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ABSTRACT 


This  jrt  describes  our  work  in  developing  a linear 
predictive  ,Jt  eech  compression  system  that  transmits  h..gh  quality 
speech  at  lov*  bit  rates.  We  have  developed  several  methods  for 
reducing  the  redundancy  in  the  speech  signal  without  sacrificing 
speech  quality.  Briefly,  preemphasis  of  speech  was  used  to 
reduce  its  spectral  dynamic  range  and  thereby  improve  the 
accuracy  of  parameter  quantization.  The  optimal  order  of  the 
linear  predictor  was  adaptively  determined  for  every  speech 
frame  as  the  lowest  value  that  adequately  represents  the  speech 
signal.  Among  several  equivalent  sets  of  predictor  parameters 
that  were  investigated,  the  reflection  coefficients  were  judged 
to  be  the  best  for  use  as  transmission  parameters.  An  optimal 
procedure  for  quantizing  the  reflection  coefficients  was 
developed  by  minimizing  the  maximum  spectral  error  due  to 
quantization.  The  latter  criterion  uas  found  to  yield 
synthesized  speech  with  maximum  quality  for  a given  average 
transmission  rate.  A scheme  was  used  to  transmit  speech 
parameters  at  variable  rates  in  accordance  with  the  changing 
characteristics  of  the  incoming  speech.  An  information 
theoretic  method  was  used  to  encode  the  quantized  transmission 
parameters  at  significantly  lower  bit  rates,  and  with  absolutely 
no  effect  on  speech  quality.  Finally,  with  the  time-synchronous 
method  of  analysis,  improved  speech  quality  was  obtained  when 
synthesis  was  also  done  time  synchronously.  In  addition  to 
these  major  results,  numerous  other  minor  results  of  relevance 
to  the  stated  goal  were  also  obtained.  As  a combined  result  of 
all  these  findings,  we  obtained  high  quality  speech  at  average 
transmission  rates  as  low  as  1500  bits  per  second. 
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Bolt  Beranek  and  Newman  Inc. 
SPEECH  COMPRESSION 

I.  INTRODUCTION 

In  order  to  use  the  ARPA  Network  for  voice  transmission,  a 
speech  compression  system  which  achieves  high  quality  speech  at 
low  transmission  rates  is  needed.  Du^-ng  the  past  two  years  our 
research  in  various  aspects  of  linear  predictive  speech 
compression  systems  (also  known  as  linear  predictive  vocoders) 
has  yielded  numerous  analytical  and  experimental  results  which 
can  be  applied  to  increasing  both  the  quality  of  the  speech  and 
the  efficiency  of  parameter  transmission. 

We  have  developed  a time-asynchronous  digital  vocoder  in 
which  the  transmission  rate  varies  according  to  the  properties 
of  the  incoming  speech  signal.  The  variable  transmission  rate 
has  a low  upper  bound  as  well  as  a low  average,  an  important 
consideration  for  a real-time  application  such  as  transmission 
over  the  ARPA  Network. 

A.  Summary  of  Major  Results 

The  objective  of  a speech  compression  system  is  to  reduce 
the  redundancy  present  in  the  speech  signal  as  much  as  possible 
while  maintaining  good  quality  in  the  synthesized  speech.  The 
use  of  the  linear  prediction  method  for  modeling  speech  already 
provides  a major  step  towards  meeting  this  objective.  Our 
project  has  been  directed  towards  significant  implementation 


Report  No.  2976 
Volume  II 


-1- 


Report  No.  2976 
Volume  II 


Bolt  Beranek  and  Newman  Inc. 


aspects  of  the  linear  predictive  vocoder  that  can  result  in 
further  compression  with  limited  distortion  in  speech  quality. 
We  have  collected  statistics  about  the  parameters  of  the  linear 
prediction  model  by  analyzing  speech  utterances  from  male  and 
female  speakers.  These  statistics  were  used  in  the  development 
as  well  as  in  the  implementation  of  several  compression  schemes. 
The  major  results  in  our  project  are  summarized  below  under  the 
headings  of  quantization,  encoding  and  synthesis.  first, 

however,  we  state  the  guiding  principle  that  we  have  used  in 
linking  speech  quality  to  transmission  rate. 

1.  The  Minimax  Principle 

We  have  developed  a systematic  and  objective  design 

criterion  which,  in  our  experience,  leads  to  synthesized  speech 
with  high  quality  for  a given  average  transmission  rate.  The 
criterion  is  to  minimize  the  maximum  spectral  error  in  the 

synthesized  speech.  This  minimax  principle  has  been  used 

* 

consistently  in  our  research  and  is  basically  responsible  for 
the  high  quality  output  of  our  low  bit  rate  systems. 
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2.  Quantization 

Quantization  is  the  major  source  of  bit  rate  reduction.  We 
distinguish  three  types  of  quantization:  parameter 

quantization,  predictor  order  quantization,  and  ^ime 

quantization.  Below  is  a summary  of  the  results  in  these  areas. 

(a)  Parameter  Quantization 

(i)  We  found  that  reducing  the  spectral  dynamic  range 
of  the  input  speech  improves  quantization 
accuracy,  regardless  of  which  set  of  parameters 
is  chosen  for  quantization.  We  proposed  methods 
for  the  reduction  of  the  dynamic  range  by 
preprocessing  of  tha  speech  signal. 

(ii)  From  a comparative  study  of  the  quantization 
properties  of  a number  of  parameter  sets,  we 
concluded  that  the  reflection  coefficients  are 
the  best  set  for  use  as  transmission  parameters. 

(iii)  We  determined  an  optimal  method  for  the 

quantization  of  the  ref:  ection  coefficients  by 
making  use  of  the  minimax  principle.  The  optimal 
procedure  consists  of  first  transforming  the 
reflect  on  coefficients  t"  log  area  ratios  and 
then  linearly  quantizing  these  transformed 

parameters . 
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(iv)  An  optimal  solution  was  also  derived  for  the 
problem  of  allocating  a fixed  number  of  bits 
among  the  parameters. 

(b)  Variable  Order  Predictor  (Order  Quantization) 

We  have  found  that  different  speech  sounds  can  be 

represented  adequately  by  different  order  linear  predictors. 
Thus,  rather  than  sending  a maximum  number  of  parameters  for 
every  frame,  one  can  minimize  the  bit  rate  by  sending  the 
minimum  number  of  parameters  that  adequately  represent  that 
frame-  We  have  introduced  an  information  theoretic  criterion 
that  allows  us  to  determine  the  optimal  order  for  each  analysis 
frame  . 


(c)  Variable  Frame  Rate  (Time  Quantization) 

In  deciding  now  often  to  transmit  parameters,  the 
application  of  the  minimax  principle  leads  to  the  obvious  result 
that  one  should  transmit  more  often  when  the  speech  spectrum  is 
changing  rapidly  and  less  often  when  the  soectrum  is  changing 
slowly.  In  using  this  transmission  scheme,  we  have  employed  an 
effective  criterion  to  measure  spectral  changes.  We  have  found 
that,  for  a gi/en  average  bit  rate,  variable  frame  rate 
transmission  produces  distinctly  superior  quality  speech  than 
fixed  frame  rate  transmission. 
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3.  Encoding 

We  have  collected  appropriate  statistics  on  the 
distribution  of  quantized  values  of  the  transmission  parameters. 
These  statistics  were  used  to  develop  variable  length  bit 
encoding  techniques  that  result  in  considerable  bit  savings  (on 
the  order  of  20$) . These  are  information  theoretic  techniques 
which  have  absolutely  no  effect  on  speech  quality. 

J|.  Synthesis 

Proper  methodology  in  synthesis  is  crucial  in  producing 
high  quality  speech.  We  have  found  that  time-synchronous 
updating  of  linear  prediction  parameters  at  the  synthesizer 
yields  better  speech  quality  than  pitch-synchronous  updating  if 
the  analysis  is  performed  time  synchronously.  This  method  has 
the  additional  advantage  of  simplifying  the  necessary 
computations . 

Using  the  results  summarized  above  and  others  discussed  in 
this  report,  we  were  able  to  demonstrate  good  quality  speech  at 
average  rates  of  1500  bps  (bits/sec)  or  less.  We  consider  this 
a significant  step  awards  our  goal  of  developing  a high 
quality,  low  bit-rate  linear  predictive  vocoder. 
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B.  Outline  of  Report 

In  Section  II,  the  various  components  of  a speech 
compression  system  are  introduced.  A brief  introduction  to 
linear  prediction  of  speech  is  also  given. 

In  Section  III,  we  discuss  the  extraction  of  predictor 
parameters,  pitch,  and  gain.  In  conjunction  with  this 
discussion  two  comparisons  are  made:  the  autocorrelation  versus 
the  covariance  method,  and  time-synchronous  versus 
pitch-synchronous  analysis. 

The  variable  order  linear  prediction  method  is  outlined  in 
Section  IV.  The  reason  for  varying  the  predictor  order  is  given 
first,  followed  by  a discussion  of  an  information  theoretic 
criterion  to  determine  the  "optimal"  order  for  any  analysis 
frame . 

In  Section  V,  two  methods  are  given  for  preprocessing  of 
speech  which  reduce  its  short-time  spectral  dynamic  range  and 
hence  improve  parameter  quantization  accuracy.  Logarithmic 
quantization  of  pitch  and  gain  is  dealt  with  next.  The 
remainder  of  the  section  presents  the  results  of  a comparative 
study  of  the  quantization  properties  of  several  alternate  sets 
of  parameters  which  uniquely  characterize  the  linear  predictor. 
Specifically,  it  is  concluded  that  the  reflection  coefficients 
are  to  be  preferred  over  all  other  sets  of  parameters  for 
purposes  of  quantization. 
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The  problem  of  optimal  quantization  of  the  reflection 
coefficients  is  considered  in  Section  VI.  It  is  argued  first 
that  for  good  speech  quality  it  is  necessary  to  minimize  the 
maximum  error  in  the  spectrum  of  the  linear  predictor  due  to 
parameter  quantization.  This  naturally  leads  to  the 
investigation  of  the  sensitivity  of  the  spectrum  to  changes  in 
the  values  of  the  reflection  coefficients.  Using  the  minimax 
error  criterion  and  the  results  of  the  sensitivity  analysis,  it 
is  shown  that  the  optimal  quantization  method  consists  of 
transforming  the  reflection  coefficients  to  log  area  ratios  and 
linearly  quantizing  the  latter.  An  optimal  bit  allocation 
strategy  for  the  transmission  parameters  is  also  presented.  Use 
of  an  alternate  sensitivity  analysis  of  the  reflection 
coefficients  is  then  investigated. 

Variable  frame  rate  transmission  is  discussed  in  Section 
VI  as  a means  of  significantly  lowering  the  bit  rate  while 
maintaining  high  speech  quality. 

In  Section  VIII,  we  present  an  information  theoretic 
procedure,  known  as  Huffman  coding,  which  uses  the  statistics  of 


the  quantized 

values 

of  a parameter  to 

transmit 

them 

more 

efficiently 

with 

a variable  number 

of  bits 

and  , 

very 

importantly , 

without 

introducing  any  additional  error. 

This 

encoding  method  offers  considerable  savings 

in  bit  rs 

tes . 
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Issues  relating  to  the  synthesizer  are  discussed  in  Section 
IX.  An  important  result  presented  in  this  section  is  that 
time-synchronous  synthesis  produces  better  quality  speech  than 
pitch-synchronous  synthesis. 

Section  X briefly  narrates  the  software  simulation  of  the 
entire  speech  compression  system  on  our  time-sharing  computer 
facility.  Also  included  in  this  section  are  the  typical  average 
transmission  rates  encountered  with  the  use  of  one  or  many  of* 
the  bit-saving  techniques  presented  in  the  earlier  sections. 

Section  XI  summarizes  our  work  completed  thus  far  towards 
implementing  the  speech  compression  system  in  real  time  in 
cooperation  with  the  other  sites  in  tne  ARPA  community. 

A few  other  topics  that  we  have  also  worked  on  during  this 
project  are  considered  in  Section  XII.  These  include  status  of 
our  research  in  developing  measures  for  objective  speech  quality 
evaluation,  a new  method  of  testing  the  performance  cf  the 
vocoder  at  different  sampling  frequencies  without  actually 
sampling  at  all  those  rates,  and  our  experience  with  the  two 
techniques:  formant  bandwidth  correction  before  synthesis  and 
parameter  smoothing. 
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II.  SPEECH  COMPRESSION  USING  LINEAR  PREDICTION 
A . Components  of  a Soeech  Compression  S vs ten 


Figure 

1 shows 

the 

various 

components  of  a 

speech 

compression 

system . 

The 

first 

component  analyzes  the 

speech 

signal  s(t)  that  has  been  low-pass  filtered  and  time-sampled, 
and  extracts  a vector  of  unquantized  parameters  x(t).  These 
parameters  are  then  quantized  and  encoded  in  the  encoder  as  y(t) 
and  are  transmitted  through  the  transmission  channel.  In  a 
noiseless  channel  y'(t)-y(t).  This  is  generally  the  case  for  the 
ARPA  Network.  The  parameters  y'(t)  are  decoded  in  the  decoder 
to  produce  an  estimate  x'(t)  of  the  analysis  parameters  x(t). 
The  last  component  in  Fig.  1 uses  the  parameters  x(t)  to 
synthesize  the  signal  s'(t)  which  is  an  approximation  to  the 
original  signal  s(t).  A compression  system  attempts  to  minimize 
the  number  of  bits/second  in  y(t)  while  maintaining  good  quality 
in  the  synthesized  speech.  The  nature  of  the  synthesizer 
dictates  the  type  of  analysis  to  be  performed.  Fig.  2 depicts 
the  two  major  components  of  the  synthesizer:  excitation  and 
transfer  function.  In  our  project,  we  have  done  work  on  each  of 
the  different  parts  of  the  vocoder  shown  in  Fig.  1. 

Once  a speech  model  is  chosen  (the  linear  prediction  model 
in  our  case),  any  reduction  in  transmission  rate  is  accomplished 
by  the  encoder.  The  encoder  in  Fig.  1 performs  the  two 
functions,  quantization  and  encoding.  The  quantization  process 
converts  the  extracted  parameters  into  a set  of  integers 
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Ficr.  2.  f'ajor  components  of  a speech  synthesizer. 
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using  specified  quantization  schemes.  The  encoding  process 
encodes  these  integers  into  a sequence  of  binary  digits  for 
transmission.  The  encoding  can  be  as  simple  as  direct  binary 
encoding,  or  as  complicated  as  desired  for  the  minimization  of 
the  average  transmission  rate. 

The  linear  prediction  method  models  speech  by  a time 
varying  all-pole  filter.  The  filter  parameters  are  assumed  to 
vary  slowly  enough  so  that  they  :*n  be  considered  constant  over 
an  analysis  frame,  usually  10-20  msec  long.  Next,  we  briefly 
review  the  linear  prediction  method  and  provide  necessary 
background  for  later  sections. 


B.  Linear  Prediction  of  the  Speech  Signal 

In  linear  prediction,  speech  is  modeled  by  an  all-pole 
filter 


H (z ) 


G 

P 

1 + £ a.  z 

k=l  * 


(1) 


as  shown  in  Fig.  3.  The  parameters  a^,  1<k<.p,  are  known  as  the 
predictor  coefficients,  and  G is  the  filter  gain.  The  input  to 
the  filter  is  either  a sequence  of  pulses  separated  by  the  pitch 
period  for  voiced  sounds,  or  white  noise  fcr  fricated  (or 
unvoiced)  sounds.  For  a particular  speech  segment  the  filter 
parameters  are  obtained  by  passing  the  sampled  speech 
signal  through  the  inverse  filter 
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VOICED 


FRICATED 


(a)  FREQUENCY- DOMAIN  MODEL 


G 


(b)  TIME -DOMAIN  MODEL 


rici.  3.  Discrete  model  of  sneech  production  as  employed  in 
linear  prediction. 
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as  in  Fig. 
error 


A(z)  = 1 + I a,  z 
k=l  K 


-k 


4,  and  then  minimizing  the  total-squared 


2 P 2 

E « 2 e - Z(s  + Z a.  s .) 
n n n n k n-k 


(2) 

predictor 

(3) 


with  respect  to  a^.  Depending  on  the  range  over  which  the 
summation  in  (3)  applies  and  the  definition  of  the  signal  sn  in 
that  range,  we  have  the  two  linear  predictive  methods  of 
analysis:  the  covariance  method  and  the  autocorrelation  method. 
For  the  covariance  method,  the  signal  is  defined  over  a finite 
range,  -P<n<N-1,  and  the  minimization  of  E leads  to  the 
following  normal  equations  [1,2]: 

ak  Cik  - -ci0  ' l5i-P  ' <4> 


where 


N-l 

Cik  = n=0  Vi 


(5) 


The  pxp  coefficient  matrix  [ j 3 of  the  system  of  equations  in 
(4)  is  symmetric  and  positive  definite,  and  is  called  the 
covariance  matrix.  The  covariance  normal  equations  can  be 
solved  by  an  efficient  triangularization  method  [33*  Using  (4) 
and  (3),  the  minimum  prediction  error  is  given  by  [2] 


E 

P 


P 

Z 

k=l 


ak  C0k 


(6) 
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Pi q.  4,  The  error  sequence  as  the  output  of  an  inverse 
filter  A (z) . 
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For  the  autocorrelation  method,  the  signal  sn  is  assumed  to 
be  defined  over  the  infinite  interval,  -®<n<®.  Usually,  the 
signal  is  multiplied  by  a finite  window  (c.g.  Hamming)  so  that 
sn  = 0 for  n<0  and  n>N-1.  The  norma.1  equations  for  the 
autocorrelation  method  can  be  shown  to  be  [2, 4, 6, 7] 


P 

2 

k=l 


lsi£p  , 


(7) 


where 


Mil 

z 


R.  = Z S S.  . . . 
1 n=0  n 


(8) 


is  the  autocorrelation  function  of  the  windowed  signal  sn . The 
autocorrelation  matrix  [R^  j]  is  symmetric,  positive  definite 
and  Toeplitz  (the  values  along  any  diagonal  are  equal). 
Levinson's  method  can  be  used  to  recursively  solve  the 
autocorrelation  equations  [2,4,5].  The  minimum  prediction  error 
is  obtained  by  substituting  (7)  in  (3)  as 


P 

Ro  + ^ ak 

0 k=l  K K 


(9) 


When  applying  Levinson's  recursive  method  to  solve  (7),  we 
also  obtain  the  auxiliary  quantities,  k^ , KilP,  which  are 
called  the  reflection  coefficients  [10.23]  (or  partial 
correlation  coefficients  [8,9]).  Reflection  coefficients  occur 
naturally  in  the  treatment  of  the  vocal  tract  as  a lossless 
acoustic  tube  with  p sections,  each  with  a different 
cross-sectional  area  [1,10].  The  filter  H(z)  in  (1)  is  stable 
(i.e.,  poles  of  H(z)  lie  inside  the  unit  circle  in  the  z plane) 
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if  and  only  if 

-i  < k.  < i , i<i5p  . 


(10) 


Of  interest  also  is  the  normalized  error  V which  is  the 

P 

ratio  of  the  minimum  error  to  the  energy  of  the  input  speech 
signal , i .e . 


E /R 
P '•> 


(ID 


For  the  autocorrelation  method,  V can  be  expressed  in  terms  of 

P 

the  reflection  coefficients  as  [2] 


VP  “ 


P 

n 

3=1 


(1-k-) 


(12) 


There  exist  two  methods  of  synthesizing  speech  using  the 
analysis  parameters.  First,  the  prediction  error  signal  en  (or 
a simple  transformation  thereof)  is  used  as  input  to  an  all-pole 
filter  1/A(z)  to  produce  the  synthesized  speech.  A vocoder 
using  this  synthesis  approach  is  called  a residual  excited 
vocoder  [11,12].  As  this  vocoder  needs  th^  transmission  of  the 
error  signal  at  the  speech  sampling  rate,  the  transmission  rate 
for  acceptable  speech  quality  is  relatively  high  (on  the  order 
of  10,000  bps).  As  our  goal  was  to  develop  a low  bit-rate 
vocoder  (less  than  2000  bps),  we  aid  not  consider  the  residual 
excited  vocoder  in  our  research. 


The  second  method  of  synthesis  uses  the  pulse/noise 
excitation  as  input  to  the  all-pole  filter  H(z)  (see  Fig.  3) • A 
vocoder  using  this  synthesis  approach  is  called  a pitch  excited 
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vocoder.  For  each  analysis  frame,  a decision  has  to  be  made  as 
to  whether  that  frame  is  voiced  or  unvoiced,  and  if  voiced,  the 
value  of  the  pitch  period  has  to  be  determined.  In  addition  to 
the  voicing  information,  the  gain  G of  the  filter  H(z)  has  to  be 
determined.  By  equating  the  energy  of  the  synthesized  speech  to 
the  energy  of  the  original  speech,  G can  be  shown  to  be  related 
to  the  minimum  prediction  error  by  [7,8,14] 

- Ep  ■ Vp  " V 1 **  • 1131 

For  the  pitch  excited  vocoder,  the  transmitter  sends  the  voicing 
information  and  the  gain  at  the  same  low  rate  as  the  filter 
parameters  As  a result,  transmission  rates  of  about  2000  bps 
produce  acceptable  speech  quality.  In  our  research,  we  worked 
exclusively  with  the  pitch  excited  vocoder.  However,  all  the 
results  stated  in  this  report  that  pertain  to  the  transmission 
of  filter  parameters  also  hold  for  the  residual  excited  vocoder. 
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III.  ANALYSIS 

The  analyzer  in  Fig.  1 extracts  from  the  speech  signal  the 
excitation  and  the  transfer  function  parameters  that  are  used 
later  for  synthesis.  We  shall  first  discuss  the  extraction  of 
these  parameters  and  ther.  the  timing  of  such  extraction. 

A . Parameter  Extraction 

For  the  types  of  synthesizer  implementation  we  have 
considered,  the  transfer  function  parameters  can  be  computed 
directly  from  the  linear  prediction  coefficients.  A comparison 
of  the  two  methods  of  linear  prediction  indicates  that  the 
predictor  coefficients  obtained  in  the  autocorrelation  method 
are  guaranteed  to  produce  a stable  filter  of  the  form  given  in 
(1)  [15,16],  while  stability  cannot  be  guaranteed  in  the 
covariance  method  [1].  Computationally,  the  autocorrelation 
equations  (7)  can  be  solved  faster  than  the  covariance  equations 
(4).  Also,  the  autocorrelation  method  offers  many  useful 
spectral  interpretations  [7].  The  covariance  method  may, 
however,  produce  a better  representation  of  the  speech  signal. 
In  our  experience,  the  possible  improvement  in  speech  quality 
produced  by  the  covariance  method  was  not  commensurate  with  the 
extra  computational  cost  in  solving  (4)  and  in  coping  with 
instability  problems.  Consequently  in  all  our  worg  reported 
below,  we  used  only  the  autocorrelation  method. 
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From  a study  of  some  of  the  available  pitch  extraction 
schemes,  a modified  version  of  the  method  of  center-clipping 
[17]  was  chosen  for  use  in  our  experimental  speech  compression 
system.  The  reliability  of  the  basic  scheme  was  improved  by 
using  several  additional  decision  parameters  such  as  the 
normalized  error  associated  with  the  linear  predictor,  zero 
crossing  rate  of  the  clipped  signal  and  frame-to-frame  energy 
change  in  the  speech  signal.  Furthermore,  to  yield  accurate 
pitch  estimation  over  a wide  range  of  frequencies,  the  width  of 
the  time  window  chosen  for  pitch  analysis  was  made  variable;  it 
was  changed  from  30  msec  to  50  msec  whenever  pitch  frequency 
fell  below  100  Hz.  Equipped  with  these  features,  the 
center-clipping  algorithm  wa3  found  to  yield  pitch  data  which 
compared  quite  favorably  with  those  derived  manually  from  the 
time  signals. 

For  computing  the  gain  of  the  linear  prediction  filter  at 
the  synthesizer,  we  also  computed  and  transmitted  energy  per 
sample  of  the  input  speech  signal.  Clearly,  the  energy  of  the 
Hamming-windowed  speech  signal  is  less  than  the  energy  of  the 
unwindowed  signal.  From  both  analytical  and  experimental 
considerations,  we  found  that  multiplying  the  energy  of  the 
windowed  speech  signal  by  a factor  of  2.5  provided  a loudness 
level  for  the  synthesized  speech  that  was  about  the  same  as  the 
original  input  speech. 
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B.  Timing  of  Parameter  Extraction 

There  are  two  considerations  in  the  timing  of  parameter 
extraction.  The  first  deals  with  frame  positioning  with  respect 
to  the  pitch  period.  Although  it  is  generally  agreed  that 
pitch-synchron  analysis  is  desirable  in  terms  of  the  quality 
of  the  synthesized  speech,  it  is  also  clear  that  such  analysis 
can  be  quite  involved  in  terms  of  complexity  of  computation  and 
decision  making.  In  our  experiments  with  pitch-synchronous 
analysis  we  found  that  the  resulting  improvement  in  speech 
quality  was  only  minimal  and  not  commensurate  with  the  added 
complexity.  We  have  therefore  used  pitch-asynchronous  analysis 
exclusively . 


The  second  consideration  deals  with  the  rate  of  parameter 
extraction.  In  all  our  investigations,  parameter  extraction  was 
done  at  a constant  frame  rate.  However,  we  studied  both 
constant  frame  rate  and  variable  frame  rate  transmission  of 
parameters.  In  Section  VII,  where  we  discuss  variable  frame 
rate  transmission,  we  give  criteria  to  decide  when  to  transmit 
based  on  the  parameter  data  extracted  at  a constant  frame  rate. 
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IV.  VARIABLE  ORDER  LINEAR  PREDICTION 

If  a fixed  order  linear  prediction  is  used  for  speech  that 
is  sampled  at  a Nyquist  rate  of  10  kHz,  then  an  order  p=12  has 
been  found  satisfactory  for  modeling  of  all  the  speech  sounds. 
However,  we  found  that  for  some  sounds  (especially  unvoiced 
sounds),  good  spectral  representation  was  obtained  using  a 
considerably  lower  order  linear  predictor.  In  fact,  it  is 
possible  to  adaptively  vary  the  order  p of  the  predictor  in 
accordance  with  the  pr  perties  of  the  speech  sounds  cing 
analyzed.  The  purpose  of  using  variable  order  linear  prediction 
is  to  lower  the  transmission  rate  by  transmitting  on  the  average 
a smaller  number  of  coefficients,  without  causing  any 
perceptible  change  in  speech  quality. 

It  is  desirable  to  have  a criterion  to  determine  the 
"optimal"  (minimum)  predictor  order  that  gives  an  adequate 
spectral  representation  of  each  speech  sound.  The  criterion 
should  strike  a compromise  between  the  number  of  coefficients 
used  and  the  modeling  accuracy  obtained.  We  have  found  an 
information  theoretic  criterion  due  to  Akaike  to  be  particularly 
suitable  for  this  purpose  [18]. 

Akaike  has  stated  the  modeling  problem  as  an  estimation 
problem  with  an  associated  error  measure.  For  the  maximum 
likelihood  estimation  method,  he  has  shown  that  an  estimate  of 
the  mean  log-likelihood  reduces  to  an  information  theoretic 
measure.  When  the  estimates  of  mode'  parameters  are  close  to 
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their  true  values,  this  measure  can  be  simplified  as  [19]: 


I (p)  = -2  log  (maximum  likelihood  function)  + 2p  . ( 14 ) 


The  value  of  p for  which  I(p)  is  a minimum  is  taken  to  be  the 
optimal  order.  The  first  term  in  (1*0  relates  to  modeling  or 
estimation  error,  while  the  second  term  represents  the  model 
complexity.  Hence,  the  criterion  in  (1*0  reflects  a 
mathematical  formulation  of  the  principle  of  parsimony  in  model 
building.  In  our  problem  of  all-pole  modeling,  if  we  assume 
that  the  speech  signal  has  a Gaussian  probability  distribution, 
then  (14)  reduces  to  (neglecting  additive  constants  and  dividing 


where  V is  the  normalized  error  given  in  (11)  and  Ng  is  the 

"effective”  number  of  samples  in  the  analysis  frame . The  word 

"effective"  is  used  to  indicate  that  one  must  compensate  for 

possible  windowing.  The  effective  width  of  a window  can  be 

taken  as  the  energy  under  the  window  relative  to  that  of  a 

rectangular  window.  For  example,  we  use  a Hamming  window  ior 

which  N =0.4N,  where  N is  the  number  of  samples  in  the  analysis 
e 

f rame  . 

Note  that  the  first  term  in  (It)  decreases  as  a function  of 
p,  and  the  second  term  increases.  Therefore,  a minimum  can 
occur.  In  practice,  there  are  usually  several  local  minima; 
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then  the  value  of  p corresponding  to  the  absolute  minimum  of 
I(p)  is  taken  as  the  optimal  value.  Usually  I ( p ) is  computen  up 
to  the  maximum  value  of  interest,  and  the  minimum  of  I(p)  is 
found  in  that  region.  We  used  a maximum  value  of  p= 1 3 for 
speech  preempnasized  with  a single-zero  filter.  (For  a 
discussion  on  preemphasis,  see  Section  V-A.) 

A property  of  the  reflection  coefficients,  k^ , KiiP,  is 
that  the  values  of  k^,  i<p,  do  not  change  as  p is  varied.  So, 
when  applying  Akaike's  criterion,  we  need  only  to  compute  the 
reflection  coefficients  for  the  maximum  order  predictor.  For 
any  pth  order  predictor,  where  p is  less  than  the  maximum  value 
used,  I ( p ) is  computed  from  (15)  with  Vp  obtained  using  the 
first  p reflection  coefficients  in  (12). 

Figure  5 shows  an  example  of  the  application  of  Akaike's 
criterion  to  a voiced  sound.  The  dashed  curve  is  a plot  of  the 
normalized  error  which  decreases  monotonically  with  increasing 
p.  The  solid  curve  is  a plot  of  I(p)  in  (15)  multiplied  by  10 
log1Qe  to  obtain  the  results  in  decibels.  In  Fig.  5 the  optimal 
predictor  order  is  p = 1 0 . Note  that  I(p)  for  p>10  slopes  upward, 
but  very  gently.  This  indicates  that  the  actual  absolute 
minimum  is  quite  sensitive  to  the  linear  term  in  (15)* 
Application  of  Akaike's  criterion  to  a fricative  sound  is 
illustrated  in  Fig.  6.  The  optimal  order  for  this  case  is  3 as 
shown  at  the  top  cf  Fig.  6.  The  bottom  plot  in  Fig.  6 shows  that 
the  spectrum  of  the  third  order  predictor  (smooth  plot)  is  a 
reasonable  approximation  to  the  envelope  of  the  power  spectrum 
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1 correspond inr  speech  sirnal  (ragged  plot). 

In  practice,  the  criterion  in  (IB)  should  not  be  regarded 
o.  ar>  absolute,  because  it  is  based  on  several  assumptions  which 
mieht  not  apply  for  the  signal  of  interest.  For  example,  the 
jiussinn  assumption  might  not  hold.  Therefore,  the  experimenter 


r:\nul  :i  feel 

free 

to  adjust 

the 

criterion  to  suit 

one 's 

ippi  ic-at  ion  . 

One 

simple  way 

01 

’’tuning"  the  criterion 

is  to 

lultiply  by  an  appropriate  factor.  However,  we  have  found 
tnat  for  speech  compression  applications  the  criterion  (15)  is 
nv  itseii  adequate  without  modification. 


collected  statistics  by  applying  the  criterion  (15)  to 
oreemphas  ized  sr  '-eh  us  in**  a data  base  of  several  utterances 
: ro":  male  and  female  speakers.  The  resulting  histograms  of  the 
f/L..nal  order  are  given  separately  for  voiced  and  unvoiced 
rounds  in  rig.  7.  In  the  plots,  the  ordinate  at  a riven  number 
of  poles  rives  the  probability  of  that  number  being  chosen  as 
tr.c  optimal  order.  From  these  probabilities,  we  found  that  the 
average  value  o l the  optimal  order  is  9.6  for  voiced  sounds  and 
5.2  for  unvoiced  sounds 


when  applyinr  variable  order  linear  prediction  to  speech 
compression , we  must  also  transmit  a code  indicating  the  number 
of  poles  used  for  every  transmitted  data  frame.  For  the  case 
v/hcn  the  maximum  order  is  13,  and  with  the  use  of  the  Huffman 
coding  procedure  described  in  Section  VIII,  we  can  transmit  this 
information  with  about  3*1/  bits/frame  on  the  average.  In  our 
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experiments,  we  found  that  the  average  savings  in  transmission 
rates  with  the  use  of  variable  order  linear  prediction  ranj,.,u 
between  10  to  1*3%  . Informal  listening  tests  on  the  synthesized 
speech  did  not  indicate  any  perceptible  difference  between  the 
fixed  order  and  the  variable  order  cases. 
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V.  CHOICE  OF  PARAMETERS  FOR  QUANTIZATION 

Proper  choice  of  transmission  parameters  is  important  for 
reducing  the  bit  rate  while  maintaining  good  quality  speech  at 
the  synthesizer.  First,  we  describe  two  methods  of 
preprocessing  of  speech  which  we  have  used  to  improve  the 
quantization  properties  of  filter  parameters.  Next,  as  the 
quantization  properties  of  pitch  and  gain  are  well  understood, 
we  discuss  their  quantization  briefly.  Finally,  we  report  the 
results  of  a comparative  study  of  several  alternate  sets  of 
parameters  representing  the  linear  predictor.  Specifically,  we 
conclude  that  the  reflection  coefficients  constitute  the  best 
set  as  transmission  parameters. 

A.  Preprocessing  of  Speech 

In  our  experiments,  we  observed  that  the  short-time 
spectral  dynamic  range  is  the  single  most  important  factor  that 
affects  the  quantization  properties  of  transmission  parameters. 
We  define  the  spectral  dynamic  range  to  be  the  difference  in 
decibels  between  the  maximum  and  minimum  s^jctral  values  within 
the  frequency  range  of  interest.  The  spectral  dynamic  range  in 
turn  is  controlled  by  two  somewhat  related  quantities,  namely* 
the  overall  slope  of  the  spectrum  and  the  bandwidths  of  the 
poles  and  zeros.  A large  spectral  slope  or  some  narrow 
bandwidth  poles  result  in  a high  dynamic  range.  We  investigated 
two  methods  of  preprocessing  of  the  speech  signal  to  reduce  the 
spectral  dynamic  range  and  hence  to  improve  the  quantization 
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properties  of  transmission  parameters.  In  the  first  method, 
called  preemphasis . the  speech  signal  is  passed  through  an 
all-zero  filter  to  alter  its  spectral  slope.  The  second  method, 
what  we  call  the  bandwidth  expansion  method,  reduces  the  dynamic 
range  by  increasing  pole  bandwidths. 

1 . Preemphasis 

To  explain  the  concept,  consider  the  first-order 
preemphasis  illustrated  in  Fig.  8.  The  speech  signal  is  passed 
through  a filter  with  a simple  real  zero  at  z=b.  From  the 
amplitude  responses  shown,  it  is  clear  that  when  b is  greater 
tnan  zero,  the  high  freauency  components  of  the  spectrum  are 
emphasized,  while  when  b is  less  than  zero,  the  low  frequency 
components  are  emphasized.  The  magnitude  of  b in  both  cases 

determines  the  extent  of  the  emphasis.  It  is  desirable  to  be 

ble  to  find  for  a given  speech  segment  a value  for  b that  is 
optimal  in  some  sense.  Since  the  purpose  here  is  to  reduce  the 
spectral  slope  as  much  as  possible,  it  makes  sense  to  estimate 
the  overall  spectral  slope  of  the  speech  by  a single  pole  filter 
and  use  its  inverse  for  nreemphasis.  Since  linear  prediction 
gitfes  an  optimal  all-pole  approximation  to  the  speech  spectrum 
[2,7],  it  is  clear  that  we  can  use  linear  prediction  to 

determine  the  optimal  value  for  b.  For  the  first  order  case  this 

optimal  value  can  be  expressed  explicitly  ir,  terms  of  the 
autocorrelation  of  the  speech  signal: 
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-t  is  natural  to  ask  whether  higher  order  preemphasis  would 
be  more  desirable.  We  have  specifically  investigated  optimal 
second-order  preemphasis.  It  clearly  leads  to  a further 
reduction  in  dynamic  range.  However,  when  deemphasis  (or 
postemphasis)  is  performed,  distortions  are  introduced  into  the 
spectrum.  This  is  illustrated  in  Fig.  9.  At  the  top,  we  have 
the  linear  prediction  spectrum  after  preemphasis.  The  solid 
curve  represents  the  case  where  the  parameters  are  unquantized, 
and  the  dashed  curve  represents  the  quantized  case.  (We  used 
the  reflection  coefficients  for  quantization.)  For  the 
particular  spectrum  shown,  the  distortion  due  to  quantization  is 
small.  The  additional  distortion  attributable  to  second-order 
preenphasis  is  shown  in  the  bottom  plot,  where  the  solid  curve 
is  the  linear  prediction  spectrum  with  no  preemphasis,  and  the 
dashed  curve  is  the  corresponding  spectrum  after  preemphasis  and 
deemphasis.  In  general,  the  first  formant  is  affected  most  for 
voiced  sounds;  its  frequency  is  often  lowered,  thereby 
producing  a nasal-like  quality  or  enhanced  low-frequency  buzz  in 
the  synthesized  speech.  It  should  be  mentioned  that  such 
distentions  occur  even  without  any  quantization.  A primary 
reason  for  this  phenomenon  is  that  the  second-order  preemphasis 
often  flattens  the  spectrum  by  eliminating  the  prominent  formant 
in  the  speech.  Upon  postemphasis,  this  formant  is  not  restored 
exactly.  This  was  highlighted  in  our  experiment  where  we 
observed  that  using  an  optimal  second-order  preemphasis  filter 
in  cascade  with  a suboptimal  10th  order  linear  prediction  filter 
(total  order  12),  produced  a speech  quality  inferior  to  that  of 
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Fig.  9.  An  example  illustrating  the  effects  of 

using  the  optimal  second-order  preemphasis. 
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a 10th  order  optimal  linear  prediction  filter  with  no 
preemphasis  on  the  speech  signal. 

With  optimal  first-order  preemphasis,  only  one  real  pole  is 
removed  from  the  input  speech  spectrum.  This  does  not  result  in 
any  perceivable  distortion  in  the  synthesized  speech.  In 
Fig.  10,  the  results  with  first-order  preemphasis  for  the  same 
example  as  above  are  shown.  As  can  be  seen,  the  spectra  remain 
close  even  after  deemphasis.  Hence,  we  do  not  recommend  the  use 
of  second-order  optimal  preemphasis  in  speech  compression 
systems . 

In  continuous  speech,  the  short-time  spectrum  changes  with 

time,  thus  requiring  different  preempnasis  filters,  which  must 

be  encoded  in  some  manner  before  transmission.  We  found  that 

either  1 or  2 bit  encoding  of  preemphasis  data  was  sufficient. 

In  one-bit  encodinr  the  signal  was  either  not  preemphasized  or 

preemphasized  using  a fixed  filter.  This  filter  had  its  zero  at 

1 0 Qtt  T 

50  Hz  (i.e.  b=e~  A,  T being  the  sampling  period).  With  2 
bits,  one  is  able  to  specify  ^4  preemphasis  filters.  By 
examining  the  quality  of  the  synthesized  speech,  we  concluded 
that  one-bit  adaptive  preemphasis  was  adequate.  However,  for  a 
real  time  system  it  might  be  sufficient  and  nor  practical  to 
use  simple  fixed  preemphasis. 
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2.  Bandwidth  Expansion  Method 

This  method  reduces  the  spectral  dynamic  range  by 

multiplying  the  impulse  response  of  the  inverse  filter  A(z) 

* 

(i.e.  the  predictor  coefficients)  by  a decaying  exponential  . 
The  new  predictor  coefficients  are  given  by 

a'  = a c“on  , o>0  , lsnsp  . (16) 

n n 

The  result  of  this  is  to  shift  the  poles  of  the  linear  predictor 
inwards  with  respect  to  the  unit  circle  in  the  z plane,  thus 
widening  their  bandwidths  [20]. 

Preprocessing  by  either  of  these  methods  can  be  done  after 
the  linear  prediction  analysis,  so  that  it  can  be  viewed  as  part 
of  the  encoding  orocess.  In  our  experience  we  have  found 
preempnasis  to  be  a more  effective  preprocessing  method  than  the 
bandwiith  expansion  method. 


*If,  however,  an  appropriate  growing  exponential  is  used,  many 
of  the  pole  oandwidths  decrease  thus  enhancing  the  formant  peaks 
in  the  spectrum  and  facilitating  better  formant  tracking  [2,7]. 
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B.  Quantization  of  Pitch  and  Gain 

We  quantize  both  pitch  and  gain  logarithmically  [21].  We 
use  6 bits  to  quantize  pitch,  with  the  quantization  level  of  0 
indicating-  an  unvoiced  frame.  The  range  of  pitch  frequency  is 
taken  to  be  50-450  Hz.  As  gain  parameter,  we  quantize  the  mean 
square  value  of  the  speech  signal  using  5 bits.  We  assume  a 
range  of  45  decibels. 


C.  Choice  of  Filter  Parameters 

For  use  as  transmission  parameters,  we  chose  to  investigate 
the  following  sets  of  parameters  which  uniquely  characterize  the 
linear  prediction  filter  H(z): 

(1)  Impulse  response  of  the  inverse  filter  A(z),  i.e. 
predictor  coefficients  an , Kn<p. 

(2)  Impulse  response  of  the  all-pole  model  H(z),  hR , CKn.<p, 
which  are  easily  obtained  from  (1)  by  long  division.  Note 
that  the  first  p+1  coefficients  uniquely  specify  the 
f i 1 1 er  . 

(3)  Autocorrelat ion  coefficients  of  {an/G} , 

i p-ui 

bi-mr,£0  aj  VUI  ' V1  ' °-15p-  a?) 

v*  J " U 

(4)  Autocorrelation  coefficients  of  {n^} 
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ri  " jfg  hj  hj+|i|  ' 05i-P  • U8> 

It  can  be  shown  that  r^  is  equal  to  R^  in  (8)  for  0<i.<p 
[2,7]. 

(9)  Spectral  coefficients  of  A(z)/G,  , CKilp,  (or 

equivalently  spectral  coefficients  of  H(z),  1 / P ^ ) 

pi  = bo  + 22x  bj  CO!!  ' 0£i-p  ' d9) 

where  b.  are  as  defined  in  (17).  In  words,  {P^}  is 
obtained  from  {b.}  through  a discrete  Fourier  transform 

i 

(DFT).  Traditionally,  vocoders  that  transmit  the  spectrum 
at  selected  frequencies  have  been  known  as  channel 
vocoders.  Thus,  use  of  the  spectral  coefficients  as 
transmission  parameters  leads  to  a linear  prediction 
channel  vocoder . While  in  the  classical  channel  vocoder 
different  channel  signals  are  derived  from  contiguous 
band-pass  filters,  in  the  linear  prediction  channel 

vocoder  a selected  set  of  p+1  points  from  the  all-pole 
spectrum  constitute  the  ’’channel  outputs.”  The  main 
advantage  of  the  linear  prediction  channel  vocoder, 
however,  is  that  we  are  able  to  regenerate  exactly  the 
all-pole  spectrum  from  a knowledge  of  the  p+1  spectral 
coefficients,  unlike  in  the  classical  channel  vocoder. 

(6)  Cepstral  coeff icients  of  A(z),  cn , 1<n<p,  (or  equivalently 
cepstral  coefficients  of  H(z)/G,  -c  ) 
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cn  c A (e^w)  c 

-71 


3H0J 


du 


Since  A(z)  is  minimum  phase,  we  obtain  using  the  results 
riven  in  [22,  p.  246] 


n-1 


m 


(20) 


n n " n cm  an-m  ' 2£n£P  * 

(7)  Poles  of  H(z)  (or  equivalently  zeros  of  A(z)). 

(8)  Reflection  coefficients  , KilP»  or  simple 

transformations  thereof,  e.g.  area  coefficients  [1,10]. 
The  area  coefficients  are  given  by 


1+k . 


Ai c Ai+i  r^~  » Vi = 1 ' l£isp  • 


(21) 


Some  of  the  above  sets  of  parameters  have  p+1  coefficients 
while  others  have  only  p coefficients.  However,  for  the  latter 
sets  the  signal  energy  (or  gain  G)  needs  to  be  transmitted  as 
well,  thus  keeping  the  total  number  of  parameters  as  p+1  for  all 
the  cases.  Although  the  above  sets  provide  equivalent 
information  about  the  linear  predictor,  their  properties  under 
quantization  are  different.  Certain  aspects  of  the  sets  (1), 
( 4 ) , (7)  and  (8)  have  been  studied  in  the  past  [1,9].  Our 
purpose  was  to  investigate  the  relative  quantization  properties 
of  all  these  parameters  with  a particular  emphasis  on  the 
reflection  coefficients. 
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It  should  be  emphasized  tha  the  predictor  coefficients  can 
be  recovered  from  any  of  the  various  sets  of  parameters  listed 
above.  The  required  transformations  for  such  a recovery  are 
qiven  below  only  for  the  sets  (3)  and  (9)  since  they  are  either 
well-known  or  obvious  for  the  others  [23]. 

The  sequence  { b ^ } is  transformed  through  an  FFT  after 
appending  it  with  an  appropriate  number  of  zeros  to  achieve 
sufficient  resolution  in  the  resultinr  spectrum  of  the  filter 
A(z)/u.  The  spectrum  of  the  all-pole  filter  H(z)  is  then 
obtained  by  simply  inverting  the  amplitudes  of  the  computed 
spectrum.  Inverse  Fourier  transformation  of  the  spectrum  of 
H(z)  yields  autocorrelation  coefficients  { r ^ } defined  in  (18), 
The  first  p+ 1 autocorrelation  coefficients  r^ , CKi<.p,  are  then 
used  to  compute  the  predictor  coefficients  via  the  normal 
equations  (7)  with  R^=r^,  r,£i-lP* 

The  predictor  coefficients  are  recovered  >on  the  spectral 
coefficients  {P^}  by  first  taking  the  inverse  DFT  of  the 
sequence  {P^}  to  f*et  the  autocorrelation  sequence  { b ^ } . The 
process  of  p;ettinq  the  predictor  coef f icients  from  {b^}  has  been 
discussed  above. 

For  the  purpose  of  quantization,  two  desirable  properties 
for  a parameter  set  to  have  are:  (a)  filter  stability  upon 
quantization  and  (b)  a natural  ordering  of  the  parameters. 
Property  (a)  means  that  the  poles  of  H(z)  continue  to  be  inside 
the  unit  circle  even  after  parameter  quantization.  By  (b)  we 
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mean  that  the  parameters  exhibit  an  inherent  ordering,  e.g.  the 
predictor  coefficients  are  ordered  as  a ^ ,a^ , . . . ,ap . If  a^  and  a ^ 
are  interchanged  then  H(z)  is  no  longer  the  same  in  general, 
thus  illustrating  the  existence  of  an  ordering.  The  poles  of 
H(z),  on  the  other  hand,  are  not  naturally  ordered  since 
interchanging  the  order  of  any  two  poles  does  not  change  the 
filter.  When  such  an  ordering  is  present,  a statistical  study 
on  the  distribution  of  individual  parameters  can  be  used  to 
develop  better  encoding  schemes  (e.g.  Huffman  coding,  see 
Section  VIII).  Only  the  poles  and  the  reflection  coefficients 
ensure  stability  upon  quantization,  while  all  the  sets  of 
parameters  except  the  poles  possess  a natural  ordering.  Thus, 
only  the  reflection  coefficients  possess  both  of  these 
properties . 

We  investigated  experimentally  the  quantization  properties 
of  the  sets  of  parameters  discussed  above,  with  and  without 
preprocessing  of  the  speech  signal.  The  absolute  error  between 
the  log  power  spectra  of  the  unquantized  and  the  quantized 
linear  predictors  was  used  as  a criterion  in  this  study,  since 
we  believe  that  a good  spectral  match  is  necessary  for 
synthesizing  speech  with  srood  quality.  A summary  of  the  results 
is  provided  in  the  following. 

The  impulse  responses  { a n } and  { hn } are  highly  susceptible 
to  causing  instability  of  the  filter  upon  quantization.  This  is 
well-known  from  discrete  filter  analysis.  Positive  definiteness 
of  autocorrelation  coefficients  { b ^ } and  {r  } is  not  ensured 
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under  quantization,  which  also  1 ads  to  instabilities  in  the 
linear  prediction  filter.  An  attempt  to  synthesize  speech  with 
quantized  autocorrelation  coefficients  {r^}  resulted  in 
distinctly  perceivable  "clicks”  in  the  synthesized  speech.  Our 
conclusion  is  that  the  impulse  responses  and  autocorrelation 
coefficients  can  be  used  only  under  minimal  quantization,  in 
which  case  the  transmission  rate  would  be  excessive. 

In  the  experimental  investigation  of  the  spectral  and 
cepstral  parameters,  we  found  that  the  quantization  properties 
of  these  parameters  are  generally  superior  to  those  of  the 
impulse  responses  and  autocorrelation  coefficients.  The 
spectral  parameters  often  yield  results  comparable  to  those 
obtained  by  quantizing  the  reflection  coefficients.  However, 
for  the  cases  when  the  spectrum  consists  of  one  or  m^re  very 
sharp  peaks  (narrow  bandwidths),  the  effects  of  quantizing  the 
spectral  coefficients  often  result  in  the  autocorrelation 
coefficients  {b^}  being  non-positive  definite  and  hence  cause 
certain  regions  in  the  reconstructed  spectrum  to  become 
negative.  This  in  turn  causes  the  autocorrelation  coefficients 
{ r ^ } to  be  non-positive  definite,  which  leads  to  instability  of 
the  filter.  Preprocessing  the  speech  signal  by  the  bandwidth 
expansion  method  (see  Section  A)  remedies  this  situation,  but 
the  spectral  deviation  in  these  regions  can  be  relatively  large. 
Quantization  of  cepstral  parameters  can  also  lead  to 
instabilities,  where  the  predictor  coefficients  are  computed 
from  (20).  As  before,  with  proper  preprocessing  stability  is 
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restored,  but  at  the  expense  of  increased  spectral  deviation. 

As  mentioned  earlier,  the  stability  of  the  filter  H(z)  is 
guaranteed  under  quantization  of  the  poles.  This  makes  the 
poles  potentially  a good  set  of  parameters  for  transmission. 
Unfortunately,  the  poles  do  not  possess  a natural  ordering:  a 
property  that  is  necessary  if  a low  transmission  rate  is 
desired.  Traditions] ly , poles  have  been  ordered  in  terms  of 
vocal  tract  resonances  (formants).  Since  the  ranges  of 
frequencies  for  the  various  formants  have  been  well  established, 
their  quantization  can  be  done  with  improved  accuracy.  In 
addition,  the  formant  bandwidths  may  be  quantized  less 
accurately  than  formant  frequencies,  which  leads  to  further 
savings  in  transmission  rate.  However,  experience  has  shown 
that  the  problem  of  identifying  the  poles  as  ordered  formants  is 
computationally  complex  and  involves  a fi ir  amount  of  decision 
making  which  is  not  completely  reliable.  In  addition,  computing 
the  poles  requires  finding  the  roots  of  a pth  order  polynomial 
(p-12):  not  a straightforward  task. 

Based  on  the  results  of  our  experimental  study  of  the 
spectral  deviation  due  to  quantization,  on  computational 
considerations , and  on  stability  and  natural  ordering 
properties,  we  concluded  that  the  reflection  coefficients  are 
the  best  set  for  U3e  as  transmission  parameters.  In  addition  to 
these  advantages,  the  values  of  the  reflection  coefficients  k^ , 
i<p,  do  not  change  as  p is  varied,  unlike  any  of  the  other 
parameters.  (This  property  of  the  reflection  coefficients  is 
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useful  when  applying  the  variable  order  linear  predictive 
analysis  discussed  in  Section  IV.) 
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V OPTIMAL  QUANTIZATION  OF  REFLECTION  COEFFICIENTS 

Having  selected  the  reflection  coefficients,  we  proceeded 
to  develop  an  optimal  quantization  scheme  which  gives  the  best 
results  in  terms  of  the  quality  of  the  synthesized  speech. 
First,  we  established  a suitable  criterion  with  respect  to  which 
we  developed  an  optimal  quantization  scheme.  It  is  known  that 
an  utterance  that  has  been  synthesized  perfectly  but  for  one  or 
two  ’’glitches”  (segments  involving  large  errors  of  some  sort) 
would  invariably  be  rated  by  a human  subject  as  having  a 
relatively  poor  quality.  In  other  words,  these  glitches  mask 
the  perception  giving  an  ;mpression  that  the  utterance  has  been 
poorly  synthesized.  Thus,  the  quality  of  the  synthesized  speech 
is  a function  of  the  "maximum  perceptual  error”  between  the 
synthesized  and  the  original  speech.  Therefore,  a reasonable 
criterion  is  to  minimize  the  maximum  perceptual  error.  We 
assumed  that  an  accurate  representation  of  the  power  spectrum  is 
necessary  for  synthesizing  good  quality  speech.  Thus,  the 
criterion  we  used  for  optimal  quantization  was  to  minimize  the 
maximum  spectral  error  due  to  quantization. 

To  use  the  minimax  spectral  error  criterion  in  developing 
an  optimal  scheme  for  quantizing  the  reflection  coefficients,  it 
was  necessary  first  to  investigate  the  sensitivity  of  the 
all-pole  model  spectrum  to  small  changes  in  the  values  of  the 
reflection  coefficients.  Section  A below  describes  this 
sensitivity  analysis.  The  development  of  an  optimal 
quantization  scheme  using  the  sensitivity  properties  is  given  in 
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Section  B.  Section  C presents  an  optimal  bit  allocation  strategy 
that  we  derived  for  the  transmission  parameters  by  minimizing 
the  maximum  spectral  error  due  to  quantization.  Finally,  we 
present  in  Section  D the  results  of  our  investigation  of  a 
second  sensitivity  measure  for  the  reflection  coefficients.  A 
detailed  description  of  the  material  given  in  this  section  can 
be  found  in  [ 23] * 


A . Sensitivity  Analysis 

We  define  the  spectral  sensitivity  for  the  reflection 
coefficients  by  [231 

IT 
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~ Lk.+O 


where 


1 

aFT 
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is  the  spectrum  of  the  all-polo  model  H(z).  The  quantity  between 
brackets  in  (22)  is  the  spectral  deviation  due  to  a perturbation 


in  the  ith  reflection  coefficient.  Experimentally, 


PS 
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was 


computed  by  replacing  the  integral  by  a summation,  and  by  using 
a sufficiently  small  value  for  Ak^.  A sensitivity  curve 

3c  1 

versus  k^  was  obtained  by  plotting  sensitivity  values  for 

ki  in  the  interval  (-1,1)  while  keeping  the  other  reflection 
coefficients  fixed.  We  performed  this  type  of  sensitivity 
analysis  for  a large  number  of  speech  sounds  recorded  from  male 
and  female  speakers.  The  sensitivity  curves  have  the  following 


(22) 
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properties  in  common: 

(i)  Each  sensitivity  curve  versus  k^  has  the  same  general 

shape . irrespective  of  the  index  i and  irrespective  of  the 
values  of  the  other  coefficients  kfl , n^i,  at  which  the 
sensitivity  is  computed. 

(ii)  Each  sensitivity  curve  is  U-shaped.  It  is  even-symmetric 
about  ki=0,  and  has  large  values  when  the  magnitude  of  ki 
is  close  to  1 and  small  values  when  the  magnitude  of  k^  is 
close  to  zero. 


It  must  be  emphasized  that  property  (i)  refers  only  to  the 
shape  of  the  sensitivity  curve.  The  actual  value  of  the 
sensitivity  for  a particular  reflection  coefficient  does  in 
general  depend  on  the  values  of  the  other  reflection 
coefficients . 

Although  the  above  sensitivity  properties  were  derived 
experimentally  by  perturbing,  one  at  a time,  the  magnitudes  of 
the  reflection  coefficients  that  corresponded  to  different 
speech  sounds,  these  properties  should  be  viewed  as  inherent  to 
the  reflection  coefficients  themselves  and  not  to  the  particular 
speech  sounds.  Thus,  voiced  sounds  generally  have  a higher 
spectral  sensitivity  than  unvoiced  sounds  because  some  of  the 
reflection  coefficients  for  voiced  sounds  have  magnitudes  close 
to  1.  Also,  in  general,  preemphasis  reduces  the  spectral 
sensitivity  of  voiced  sounds  by  reducing  the  magnitudes  of  the 
reflection  coefficients  which  are  close  to  1. 
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The  sensitivity  properties  given  above  strongly  suggested 
the  existence  of  a prototype  sensitivity  function  which  would 
apply  approximately  to  every  reflection  coefficient  and  for 
different  speech  sounds.  Such  a prototype  function  could  then 
be  used  in  developing  an  optimal  quantization  scheme  that  would 
apply  to  all  reflection  coefficients  all  the  time.  In  view  of 
the  above  sensitivity  properties,  we  computed  this  prototype 
sensitivity  function  by  simply  averaging  the  sensitivity  cui ves 
over  different  reflection  coefficients  and  for  a large  number  of 
different  speech  sounds.  Such  an  averaged  sensitivity  function 
is  shown  plotted  as  the  solid  curve  in  Fig.  11.  In  this  plot  the 
sensitivity  values  are  given  in  decibels  relative  to  the 
sensitivity  at  k = 0.  In  the  following,  we  present  an  optimal 
quantization  scheme  for  the  reflection  coefficients  which  we 
developed  using  the  averagec  sensitivity  function  in  Fig.  11. 


B . Opt: ma 1 Quant iza t ion 

From  the  sensitivity  properties  of  the  reflection 
coefficients  discussed  in  the  previous  section  and  depicted  in 
Fig.  11,  it  is  clear  that  linear  quantization  of  the  reflection 
coefficients  is  not  satisfactory,  especially  when  some  of  them 
take  values  close  to  1 in  magnitude.  What  is  needed  is  a 
nonlinear  quantization  scheme  that  is  much  more  sensitive  (nas 
more  steps)  near  1 than  near  0.  A nonlinear  quantization  of  a 
reflection  coefficient  is  equivalent  to  a linear  quantization  of 
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Fiq.  11.  Averaqed  snectral  sensitivity  curve  for  the 
reflection  coefficients  (solid  line)  and  an 
analytical  function  that  approximates  it 
(dashed  line) . 
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a different  parameter  that  is  related  to  the  reflection 
coefficient  by  a nonlinear  transformation.  It  is  not  difficult 
to  show  that  linear  quantization  of  the  transformed  parameter  is 
optimal  (in  the  sense  of  minimizing  the  maximum  spectral  error 
due  to  quantization)  if  and  only  if  the  transformed  parameter 
has  a flat  or  constant  spectral  sensitivity  behavior.  The 
sufficiency  of  the  condition  is  evident  from  the  fact  that  with 
a flat  sensitivity  behavior  and  linear  quantization,  the  maximum 
spectral  error  is  constant  over  the  entire  range  of  variation  of 
the  parameter,  which  trivially  leads  to  a minimum  equal  to  that 
constant  value.  The  necessity  of  the  condition  can  be 
established  by  using  the  proof  by  contradiction  method  as 
follows.  If  the  transformed  parameter  does  not  have  a flat 
sensitivity  behavior,  then  a suitable  nonlinear  quantization 
leading  to  a smaller  maximum  spectral  error  can  be  found  by 
assigning  sr'ller  quantization  steps  in  regions  where  the 
parameter  has  hish  sensitivity  and  vice  versa.  This  is  clearly 
a contradiction  to  the  fact  that  linear  quantization  is  optimal. 
Thus,  the  search  for  the  optimal  quantization  scheme  for  the 
reflection  coefficients  reduces  to  the  search  for  a nonlinear 
transformation  that  results  in  a flat  spectral  sensitivity 
behavior  for  the  transformed  parameters. 

If  the  transformed  parameter  is  denoted  by  g=f(k),  we  have 
shewn  in  [23]  thdt  the  optimal  nonlinear  transformation  f(k)  is 
given  by 
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df(k)  __  T as  (23) 

where  L is  some  constant.  To  derive  the  mapping  that  is  optimal 
on  the  average  for  all  the  reflection  coefficients,  we  used  the 
averaged  sensitivity  function  in  Fig.  11.  Although  it  is 
possible  to  obtain  the  optimal  transformation  by  integrating  the 
solid  curve  in  Fig.  11  directly,  we  found  it  simpler  and 
ultimately  more  useful  to  approximate  the  averaged  sensitivity 
curve  by  a well  specified  mathematical  function  which  could  then 
be  integrated  to  obtain  an  approx,  itely  optimal  f(k).  An 
experimental  fitting  of  the  averaged  sensitivity  curve  in 

p 

Fig.  11  has  revealed  that  the  function  1 / ( 1 — k ) approximates  the 
sensitivity  function  reasonably  well  (to  within  a multiplicative 
constant),  as  shown  by  the  dashed  curve  in  Fig.  11  (note  that 
the  plot  is  given  in  decibels).  Using  this  approximation  in  (23) 
and  integrating  with  L=2,  we  get  the  optimal  mapping  as 

f(k)  ~ log  . (24) 

The  optimally  transformed  parameters  are  therefore  given  by 

l+k. 

qi  “ iog  1^7  ' 1-i"P  • (25) 

Using  (21),  the  transformation  in  (24)  is  simply  the  logarithm 
of  the  ratio  of  the  consecutive  area  coefficients.  Thus,  we 
have  shown  that  the  logarithms  of  the  area  ratios  (henceforth 
called  log  area  ratios ) provide  an  approximately  optimal  set  of 
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coefficients  for  quantization. 

Figure  12  shows  a plot  of  the  log  area  ratio  as  a function 
of  the  reflection  coefficient.  We  have  also  plotted  in  Fig.  12 
a linear  characteristic  that  passes  through  the  intersection  of 
a vertical  I'.ne  at  k = 0.7  and  the  log  area  ratio  curve.  For 
values  of  k less  than  0.7  in  magnitude,  the  log  area  ratio  curve 
is  almost  linear.  Thus,  if  a certain  reflection  coefficient 
takes  values  always  less  than  0.7  in  magnitude,  one  could 
quantize  it  linearly  to  obtain  approximately  flat  sensitivity 
characteristics.  In  practice  it  is  found  that  the  reflection 
coefficients  k ^ , i>3,  have  in  general  magnitudes  less  than  0.7. 
However,  use  of  the  log  area  ratios  automatically  leads  to  the 
desired  quantization  irrespective  of  the  re  fleet  ion  coefficient 
and  the  ranre  of  values  it  spans. 

We  note  from  (10)  and  (20)  that,  for  a stable  filter,  the 
log  area  ratios  take  on  values  in  the  region  -«®<g^<ao,  for  all  i. 
The  filter  becomes  unstable  if  any  of  the  log  aria  ratios 
becomes  unbounded.  The  potential  unboundedness  of  the  log  area 
ratios  means  that  the  range  over  v/hich  they  need  to  be  quantized 
can  be  very  large,  which  car  lead  to  an  excessive  number  of 
quantization  bits  or  else  to  very  coarse  quantization.  However, 
in  practice,  the  range  is  often  limited  by  the  types  of  signals 
that  are  processed.  For  example,  we  have  not  found  the  range  to 
be  very  large  for  speech  signals,  especially  when  preemphasis  is 
used.  The  problem  could  still  arise,  as  a result  of 
computations  with  a small  wordlength.  In  that  case,  the  range 
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could  be  limited  artificially.  This  is  good  practice  because 
otherwise  very  narrow  bandwidth  filters  would  result,  which  in 
general  is  not  a rood  thing  in  speech  synthesis. 

C.  Optimal  Bit  Alloc  ition 

For  the  log  area  ratios,  we  have  derived  an  optimal  bit 
allocation  strategy  by  minimizing  the  maximum  spectral  deviation 
due  to  quantization  [23].  If  g.  is  the  ith  log  area  ratio  with 
its  lower  and  upper  bounds  (g.)^  and  (g.)^,  respectively, 
and  is  the  number  of  levels  used  for  its  quantization,  the 
step  size  for  g.  is  given  for  linear  quantization  by 

6.  = —*■-  max  ~ (gl)  min 

l fT • (26) 

1 

ihe  optimal  bit  allocation  is  obtained  iffi.  is  the  same  for  all 
the  log  area  ratios.  This  result  is  also  intuitively  clear 
since  the  spectral  sensitivity  is  approximately  constant  and  is 
approximately  the  sane  for  all  the  log  area  ratios.  The  total 
number  of  bits  required  to  quantize  the  p log  area  ratios  is 

M = i°g,  irr  ] . 

2 i«i  1 

Ue  found  it  convenient  and  useful  to  begin  with  a particular 
Quantisation  step  size.  That  automatically  determines  the  total 
number  of  bits  needed,  as  well  as  the  maximum  spectral 
deviation,  which  in  turn  determines  the  resulting  speech 
•juality.  One  can  then  study  the  change  in  speech  quality  as  a 
function  ol  only  one  variable,  namely  the  quantization  step 
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size  . 


For  the  quantization  of  the  log  area  ratios  as  well  as  for 
determining  the  optimal  bit  allocation  strategy  discussed  above, 
we  need  the  knowledge  of  the  ranges  of  the  different  log  area 
ratios  g^,  Ki<p.  For  a set  of  12  speech  utterances  that  were 
sampled  at  10  kHz  and  preemphasized  using  a fixed  filter,  we 
extracted,  at  a rate  of  100  frames/sec,  the  log  area  ratios 
through  the  linear  predictive  analysis  using  p=11  and  an 
analysis  interval  of  20  msec.  The  maximum  and  minimum  values 
were  found  for  each  log  area  ratio,  and  the  corresponding  range 
was  then  determined  by  allowing  some  margin  on  both  of  these 
values.  In  this  study  we  used  10  log^Q  inr^aad  of  the  natural 
logarithm  in  computing  log  a-ea  ratios  from  (25).  So,  the 
computed  log  area  ratios  were  in  "decibels".  In  collecting  the 
range  statistics  for  log  area  ratios,  we  treated  voiced  and 
unvoiced  sounds  separately.  In  our  experience  we  found  that 
using  a step  size  of  1 dE  for  quantizing  log  area  ratios 
provides  a good  compromise  between  speech  quality  and 
transmission  rate.  In  Table  I we  have  given  the  bounds  of  log 
area  ratios  along  with  their  optimal  bit  (level)  allocations  for 
a step  size  of  1 dB  [21]. 
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D . Comments  on  Another  Spectral  Sensit ivi t v Measure 


In  Section  A we  introduced  a spectral  sensitivity  measure 
to  study  the  quantization  properties  of  the  reflection 
coefficients.  Other  types  of  sensitivity  measures  may  also  be 
usea.  In  particular  we  have  considered  a measure  whicn  is 
similar  to  the  total-squared  error  used  for  minimization  in 
linear  predictive  analysis.  By  using  Parseval's  theorem  in  (3)> 
the  total- squared  error  is  given  by 


£ 


r2  TT  P (u>) 

Vj  * 0 

2tt  p”(w) 


(27) 


where  P ( ) is  the  power  spectrum  of  the  input  speech  signal  and 
?k  ) is  the  pov/er  spectrum  of  the  all-pole  filter: 


P(w)  = Hte3") 


A<e>“) 


The  gain  G is  given  by  (13)* 


(28) 


We  have  studied  the  properties  of  the  error  measure  E in 
detail  [2,7,24].  In  particular,  the  minimization  of  E results  in 
an  all-pole  model  spectrum  P(w)  that  is  a good  approximation  to 
the  envelope  of  the  signal  spectrum  PQ(w).  Because  of  this 
property,  it  seemed  reasonable  to  study  the  use  of  this  error  E 
as  a measure  of  the  deviation  between  the  two  spectra.  For  t.ie 
sake  of  normalization  we  have  chosen  to  work  with  an  error 
measure  E'  obtained  from  ( 27 > by  eliminating  the  factor  G : 
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i 11  P,  (o>) 

E'  = 2 * pThit  du 


(29) 


where  P^U))  and  are  now  any  two  spectra.  Also,  the  two 

spectra  nre  normalized  such  that  they  have  equal  totaL  energy. 


For  our  study  of  spectral  sensitivity  we  let  P ^ (u) ) = P(  k^  ,u) ) 
and  P^Ct) ) = P(  k^+Ak^  ,w  ) , where  P(.,w)  is  given  by  (28).  The  error 
between  the  two  spectra  is  then  given  by 


E*  (Ak^) 


^ TT  P(kifw) 

Trr  ^ ~P  (k.  +Ak  . 7u I) 
-TT  1 1 


du) 


(30) 


The  new  measure  of  spectral  sensitivity  is  defined  as 


as*  Lim 


»ki 


Ak.-C 

l 


Aki 


.log 


1_ 

2tt 


P(ki,0)) 

-lr  P" [ki+Aki7wT 


/ 


dw 


(31) 


We  have 
without 
for  the 
[23l 


derived  the  s per tr 
the  need  to  rc^or 

studv  of  in  ( 

dkV 


a 1 sensitivity  in  (31) 
t to  experimental  data 
22).  The  result  can  be 


a 


analytically , 
s was  the  case 
shown  to  be 


1-k 


7 

i 


(32) 


It  is  important  to  note  that  this  is 
true  for  each  reflection  coefficient 
of  the  other  coefficients.  A plot  of 
U-shaped  curve.  Therefore,  the  spect 


an  exact  result  and  it  is 


, independent  of  the  values 


3TT 


versus  k also  gives  a 


rul  sensitivity  in  (32)  has 
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the  same  general  properties  as  the  spectral  sensitivity 
obtained  experimentally  in  Section  A.  The  only  difference 
between  the  two  is  the  actual  shape  of  the  sensitivity  curve. 

Substituting  (32)  in  the  ortinality  condition  (23)  and 
integrating  it  with  L=1,  we  obtain  the  following  optimal  mapping 
for  the  sensitivity  measure  (31): 


f ‘ (k)  = sign  (k)  log  , 

1-K 


(33) 


where  sign(k)  is  +1  if  k is  positive  and  -1  if  k is  negative. 
From  (12)  and  (33),  it  is  interesting  to  observe  that  ! f " ( ki ) ! 
is  equal  to  the  logarithm  of  the  ratio  of  the  normalized  errors 
(or  log  error  ratio ) associated  with  the  linear  predictors  of 
orders  i-1  and  i, 


V (ki) 


sign  (k^) 


log 


(34) 


We  experimentally  investigated  the  quantization  properties 
resulting  from  the  mappings  given  by  (24)  and  (33)-  Through 
informal  listening  tests  we  found  that  the  use  of  the  log  area 
ratios  for  quantization  loads  to  uniformly  better  speech  quality 
than  that  obtained  using  the  log  error  ratios.  This  points  out 
the  important  fact  that  not  all  reasonable  spectral  sensitivity 
measures  lead  to  good  results;  the  measure  must  somehow  relate 
to  perception.  Our  conclusion  is  that  the  spectral  sensitivity 
measure  in  (22)  relates  more  to  perception  than  the  measure  in 
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(3D  cince  it  produces  better  results  terms  of  speech 

quality. 
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VII.  VARIABLE  FRAME  RATE  TRANSMISSION 

This  section  deals  with  tine  quantization,  i.e.  the  rate 
and  manner  in  which  parameters  are  transmitted  in  time.  For  a 
constant  frame  rate  scheme,  parameters  are  transmitted  at  fixed 
time  intervals.  A variable  frame  rate  scheme  transmits 
parameters  only  when  the  speech  characteristics  have 
sufficiently  changed.  Parameter  transmissions  occur  more 
frequently  when  speech  characteristics  are  changing  rapidly  as 
in  phoneme  transitions,  while  the  transmissions  are  spaced 
farther  apart  when  speech  characteristics  are  relatively 
constant  as  in  steady  state  sounds.  As  compared  to  a constant 
frame  rate  transmission  system,  the  variable  frame  rate 
transmission  system  could,  if  designed  properly,  yield  lower 
transmission  rates  for  the  sane  speech  quality.  We  describe 
below  a variable  frame  rate  scheme  that  we  use  in  our  speech 
compression  system. 

To  determine  if  speech  characterist ics  have  sufficiently 
changed  since  the  last  transmission,  we  use  a measure  that  is 
the  logarithm  of  the  ratio  of  the  mean-squared  values  of  the 
error  signal  obtained  (i)  when  the  optimal  linear  predictor 
parameters  are  used  and  (ii)  when  the  last  transmitted 
parameters  are  used.  If  the  predictor  parameters  are  assumed  to 
have  Gaussian  probability  distributions,  then  this  measure  is 
the  same  as  the  log  livelihood  ratio  [25].  To  see  how  the 
transmission  scheme  works,  let  us  suppose  that  we  have  decided 
to  transmit  the  parameters  for  frame  1.  Denote  the  predictor 
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coefficients  of  frame  1 (reference  frame)  by  a£^,1<k<.p.  For 

frame  2 (test  frame),  the  optimal  linear  predictor  coefficients 
( 2 ) 

a^  » are  first  determined.  The  speech  signal  for  frame 

2 is  passed  through  the  inverse  filters  A^z'*  and  A^(z)  given  by 


P (1)  -k 
A,  (z)  = 1 + l z * 

1 k=l  K 

P (0)  -k 
A.(z)  =1+2.  a'*1  z * 

2 k=l 


(35) 

(36) 


The  mean-squared  values  of  the  output  signals  of 
filters  are  computed  as 


P U) 

E(1)  = b(1>  +2  £ b.  R, 

° ° K=1  * 


E(2)  = ,;  + l a,<2)  Rk  , 

o \zz\  * * 


these 


inverse 


(37) 

(38) 


where  are  the  autocorrelat ion  coefficients  of  the  speech 
signal  for  frame  2 and  b^1^  are  the  autocorrelation  coefficients 
of  the  impulse  response  of  A^z),  i.e. 


p:k  a) 

l a. 
i=0  1 


Otktp 


(39) 


Tiio  deviation  between  the  tv:o  sets  of  predictor  coefficients 
( 1 ) ■ 2 ) 

{a,  '}  and  {a,v  } is  compute^  usin^  the  distance  measure 

K K 


d = 10  loc;1Q  (L.(1)/b<2)  ) . 


(40) 


As  mentioned  earlier,  the  distance  measure  d in  (40)  becomes  the 
lor  likelihood  ratio  when  the  predictor  coefficients  have 
Gaussian  probability  distributions.  The  next  step  is  to  compare 

the  distance  d against  a threshold.  If  d is  within  the 
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threshold  (success),  the  data  for  frame  2 is  not  transmitted; 
however,  data  transmission  occurs  if  d exceeds  the  threshold 
(failure).  In  the  former  case,  the  above  procedure  is  repeated 
for  the  successi  3 test  frames  using  frame  1 as  the  reference, 
until  a failure  occurs  or  the  number  of  consecutive  successes 
exceeds  a preset  limit.  When  one  of  these  two  conditions  is 
satisfied,  the  data  for  frame  1 is  transmitted  along  with  the 
number  of  consecutive  successes.  At  the  receiver,  we 
interpolate  between  parameter  receptions  to  ensure  smoother- 
transitions  in  parameter  updating. 

In  our  experiments , we  used  an  analysis  rate  of  100 
franes/sec  (i.e.,  parameters  were  extracted  once  every  10  msec). 
A satisfactory  value  of  the  threshold  for  the  log  likelihood 
ratio  measure  was  found  experimentally  as  1.5  dB.  Parameter 
transmissions  were  not  allowed  to  be  spaced  by  more  than  80  msec 
(3  frames).  Variable  frame  rate  transmission  was  used  only  for 
los  area  ratios.  Pitch  and  gain  were  transmitted,  at  a constant 
rate  of  50  times/sec.  With  these  specifications,  we 
experimented  with  14  sentences  of  speech  material  from  10 
speakers  (male  and  female).  Fig.  13  shows  the  relative 
frequencies  of  occurrence  of  the  different  transmission  interval 
sizes  10-80  msec  and  the  corresponding  percentage  bit  savings  in 
transmitting  log  area  ratios.  The  transmission  rate  for  log 
area  ratios  varied  between  24  and  45  frames/sec,  with  an  average 
of  37.  Thus,  we  achieved  a total  saving  of  63%  in  transmitting 
loo’  area  ratios.  In  these  experiments  we  found  that  the  quality 
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of  the  synthesized  speech  dropped  only  slightly  for  the  variable 
frame  rate  transmission  (37  framer/sec  on  the  average)  relative 
to  the  constant  frame  rate  transmission  (100  frames/sec). 
However,  when  compared  to  a constant  50  frames/sec  system,  the 
above  variable  frame  rate  scheme  produced  distinctly  better 
quality  speech. 

As  an  alternative  to  the  log  likelihood  ratio  measure 
described  above,  we  made  a preliminary  investigation  of  another 
measure  of  spectral  deviation  using  the  log  area  ratios.  This 
measure  is  simply  the  average  of  the  absolute  differences 
between  respective  log  area  ratios  of  the  frame  under  test  and 
the  previously  transmitted  data  frame.  In  another  study  (see 
.Section  XII-A)  we  found  that  the  log  area  ratio  error  measure 
has  an  approximately  linear  relationship  with  the  spectral  error 
measure.  This  suggests  that  the  log  area  ratio  error  measure 
night  have  a ?ood  correlation  with  speech  quality.  However, 
further  testing  is  needed  before  any  conclusive  statement  can  be 
made  . 
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VIII.  VARIABLE  WOKDLENGTH  ENCODING 

We  have  investigated  information  theory  approaches  to 
transmission  rate  reduction.  An  encoding  technique,  called 
Huffman  coding,  has  been  chosen  for  our  system,  which  makes  use 
of  the  statistical  distributions  of  the  quantized  values  of  the 
transmission  parameters.  Using  the  statistical  data  for  each 
parameter,  Huffman  coding  codes  the  values  that  are  most  likely 
to  be  transmitted  with  fewer  bits.  Thus  the  number  of  bits,  or 
wordlength,  required  to  code  a set  of  values  for  a parameter  is 
variable.  It  should  be  pointed  out  that  no  compromise 
whatsoever  in  speech  quality  is  made  when  employing  Huffman 
coding,  because  the  coding  does  not  result  in  any  information 
loss;  it  merely  transmits  the  information  more  efficiently. 

Another  encoding  method  we  have  used,  called  the  delta 
encoding  method,  codes  the  change  in  a parameter  from  frame  to 
frame . With  delta  encoding,  the  statistical  distributions  used 
for  Huffman  coding  of  pitch  and  gain  became  sharpened  thus 
making  the  Huffman  cod i nr  of  these  parameters  more  effective. 


A . Huffman  Coding 

Huffman  code  is  the  optimal  unambiguous  variable  wordlength 
code  [26].  For  each  parameter  it  generates  the  lowest  possible 
average  transmission  rate.  That  is,  is  minimized,  where 
P.  is  the  probability  of  the  ith  value  of  a parameter,  and  is 
the  number  of  bits  required  to  code  that  value.  Furthermore, 
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the  particular  method  we  have  used  also  minimizes  the  maximum 

code  length  Max  L.  and  the  total  lengths  of  all  codes  £L.  [27]. 

i 1 i 1 

No  two  different  parameter  values  result  in  the  same  code,  and 

given  the  beginning  of  a code  sequence,  no  further  information 

is  needed  in  order  to  know  where  a code  begins  or  ends  (i.e. 

Huffman  code  is  unambiguous). 


In  order  to  find  the  Huffman  code  for  a particular  set  of 
values,  the  frequencies  of  occurrence  for  these  values  must  be 
known.  The  details  of  Huffman  coding  can  be  best  explained  by 
an  example.  Consider  a parameter  that  takes  on  7 possible 
values.  Given  below  are  the  7 values  and  the  number  of  times 
each  occurred: 


value 

0 

1 

2 

3 

4 
g 
6 


occurrences 

600 

200 

200 

150 

100 

100 

50 


probabi litv 
3/7 
1/7 
1/7 
3/28 
1/14 
1/14 
1/28 


To  find  the  code,  the  two  lowest  frequencies  are  found  and 
combined.  That  is,  the  frequencies  of  the  values  6 and  5 would 
have  a combined  frequency  of  150.  The  process  of  combining 
frequencies  is  continued  until  all  have  been  combined,  yielding 
a total  frequency  in  this  example  of  1400.  Fig.  14  shows  how 
these  different  frequencies  are  combined  producing  a tree 
structure.  The  boxed  numbers  in  Fig.  14  are  the  combined 
frequencies,  and  the  numbers  above  the  boxes  indicate  the  order 
in  which  the  combinations  are  formed.  The  depth  of  a node  is 
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the  maximum  path  length  from  an  initial  constituent  node  to  that 
node.  When  two  frequencies  are  equal,  as  are  the  uncombined 
frequency  corresponding  to  value  3 and  the  combined  frequencies 
of  values  5 and  6,  the  frequency  whose  depth  is  the  smaller  is 
considered  the  lower.  That  is,  the  depth  of  the  combined 
frequencies  of  5 and  6 is  1 , while  the  depth  of  the  uncombined 
frequency  corresponding  to  value  3 has  a depth  of  0,  so  the 
latter  would  be  used  in  the  next  minimum  pair. 

Once  the  combinations  have  >een  completed,  a tree  has  been 
formed  and  can  be  retraced  from  the  root  node  ( 1400)  to  find  the 
codes.  Each  arrow,  or  branch,  will  be  assigned  one  bit,  either 
0 or  1 , depending  on  whether  it  is  the  top  or  bottc  branch  into 
the  node.  (This  assignment  is  arbitrary  and  can  be  reversed  if 

desired.)  The  codes  for  the  different  values  can  be  read  from 
the  tree  as:  0*4*0,  1-110,  2-100,  3-1010,  4 — 1011,  5-1110, 

6-  1111.  The  average  length  is  therefore  the  sum  of  the  code 
lengths  tines  the  probabilities  of  their  occurrence,  or  2.^3, 
compared  to  the  simple  binary  code  which  requires  3 bits. 

The  minimum  average  length  of  the  Huffman  code  can  be 
approximated  by  the  entropy  of  the  parameter,  i.e. 

Lmin  ^ - l Pi  l0q2  Pi  * (41) 

In  the  example  given  above,  the  entropy  is  2.39. 
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Even  if  the  probability  distribution  of  the  parameter 
values  was  uniform  for  the  above  example,  Huffman  code  would 
require  an  average  length  of  2.86,  which  would  still  provide  a 
saving  ove“  simple  binary  encoding.  Thus,  the  saving  in  the 
average  code  length  of  a parameter  offered  by  Huffman  coding  is 
due  in  part  to  the  nur  of  parameter  values  being  a 

non-integer  power  of  2 a..  in  part  to  the  probability 

distribution  of  the  parameter  values  being  non-uniform. 

Huffman  coding  offers  several  advantages  over  simple  binary 
coding  of  transmission  parameters. 

(a)  Of  primary  importance,  in  the  range  of  transmission  rates 
that  we  are  interested  ( 2000  bps),  Huffman  coding 
reduces  the  transmission  rate  by  approximately  20?.  In  so 
doing,  it  introduces  no  new  approximations,  as  it  codes 
only  information,  not  acoustic  phenomena.  This  property 
also  allows  it  to  be  combined  with  other  bit-saving 
techniques  such  as  varicole  order  linear  prediction  or 
variable  frame  rate  transmission. 

(b)  Huffman  coding  allows  any  number  of  quantization  levels. 
The  number  of  quantization  levels  for  a particular 
parameter  does  not  have  to  be  a power  of  2 to  produce 
efficient  code.  This  property  allows  the  number  of  levels 
to  be  chosen  according  to  other  criteria,  such  as  equal 
step  size  or  equal  spectral  sensitivity. 

(c)  Huffman  coding  has  been  proven  optimal  [26].  It  therefore 
provides  a useful  standard  against  which  to  measure  other 
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coding  schemes. 

Along  with  its  advantages,  Huffman  coding  also  presents 
some  disadvantages.  Since,  for  a given  parameter,  the  number  of 
bits  transmitted  is  not  constant,  the  algorithms  for  packing  and 
packetizing  for  ARPA  Network  applications  must  be  more  complex. 
Also,  a tree  search  is  required  for  decoding.  This  requires 
more  storage,  more  time,  and  a more  complex  algorithm  than  would 
a table-lookup  for  the  simple  binary  code.  It  may  be  possible 
to  combine  the  trees  for  a number  of  parameters,  thus  reducing 
the  storage  required.  Because  Huffman  coding  is  based  on  the 
statistical  likelihood  of  a particular  value  occurring,  good 
statistics  over  a fairly  large  data  base  must  be  found.  Huffman 
coding  is  most  useful  when  several  values  of  the  information  to 
be  coded  are  much  more  probable  than  the  other  values. 


B.  Delta  Encoding 

The  delta  encoding  scheme  codes  the  change  in  a parameter 
from  frame  to  frame.  We  found  this  to  be  useful  for  parameters, 
notably  pitch  and  gain,  which  change  slowly  but  require  a large 
number  of  quantization  levels.  In  our  experiments,  we  observed 
that  uhe  bit  savings  with  the  use  of  delta  encoding  by  itself 
were  not  very  significant.  However,  when  we  combined  delta 
encoding  with  Huffman  coding  (by  encoding  the  changes  in 
parameter  values  with  Huffman  code),  we  achieved  significant 
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savings  in  bit  rate  for  both  pitch  and  gain.  Furthermore,  such 
a combination  removes  some  of  the  speaker-dependent  aspects  of 
these  parameters.  For  example,  the  change  in  pitch  for  a female 
speaker  is  likely  to  be  nearer  that  of  a male  speaker  than  are 
the  actual  values  of  pitch.  Similarly,  delta  encoding  makes  the 
changes  in  gain  comparable  for  loud  and  soft  speech.  Delta 
encoding  thus  improves  the  statistics  for  Huffman  coding. 


C . Statistics  for  Huffman  Coding  and  Bit  Savings 

We  describe  below  the  data  base  we  used  to  generate 
statistics  for  Huffman  coding.  Separate  statistics  for  leg  area 
ratios,  pitch,  and  gain  are  briefly  described,  and  transmission 
rate  reductions  for  these  and  two  other  parameters  are  given 
(reference  [28]  gives  more  details). 

1 . Data  Base 

We  used  8 sentences,  each  from  a different  speaker,  of  whom 
5 were  male  and  3 female,  as  a data  base.  Each  sentence  was 
sampled  at  10  kHz  and  passed  ti. rough  a 50  Hz  preemphasis  filter. 
Two  types  of  analysis  were  performed,  the  first  ^omouting  and 
transmitting  a new  frame  of  parameters  every  20  msec,  including 
pite'r  and  rain  (constant  frame  rate  transmission),  and  the 
second  computing  new  parameters  every  10  msec,  but  transmitting 
only  when  the  log  likelihood  ratio  exceeded  a threshold  of  1.5 
dB  (variable  frame  rate  transmission).  Fitch  and  gain  were 
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computed  and  transmitted  every  20  msec  as  in  the  first  type  of 
analysis.  Eleven  coefficients  (fixed  order:  p=  11)  were  used  in 
both  types  of  analysis.  The  log  area  ratios  were  used  for 
transmission,  and  they  were  quantized  as  described  in  Section 
VI-C , using  a quantization  step  size  of  1 dB.  Pitch  and  gain 
were  both  qua  zed  logarithmically  using  6 and  5 bits 
respectively.  Histograms  were  then  compiled  for  each  log  area 
ratio  (one  each  for  voiced  and  unvoiced),  and  for  pitch  and 
gain. 


2.  Log  Area  Ratios 

For  illustration , we  have  plotted  the  histogram  for  the  log 


area  ratio 

for 

voiced  sounds  in  Fig. 

15  and  for 

*8  for 

unvoiced  s 

ounds 

in 

pig.  16.  Other  histograms  have 

been 

documented 

in 

128]. 

Briefly,  the  histograms 

for  g?  and 

g-.  for 

voiced  sounds  resemble  the  one  in  Fig.  15,  the  main  difference 
being  that  the  skewness  of  the  histogram  is  to  the  right  for  g ^ 
instead  of  to  the  left  as  for  both  ^ and  g^.  All  other 
histograms,  namely,  those  for  g^-g.^  for  voiced  sounds  and  for 
for  unvoiced  sounds  are  basically  similar  to  the 
histogram  in  Fig.  16.  The  histograms  for  variable  frame  rate 
transmission  were  quite  similar  to  those  obtained  for  constant 
frame  rate  trr "smission . In  order  to  reduce  the  number  of  trees 
in  Huffman  coding,  we  experimented  with  combi  ling  statistics  for 
several  comparaole  ing  area  ratios  and  representing  them  by  one 
tree  [ 28 ] . 


* * 
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Fig.  15.  Histogram  of  the  quantized  log  area  ratio  q ^ for  voiced  sounds. 
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As  pitch  is,  in  general,  a slowly  varying  parameter,  we 
investigated  coding  both  the  pitch  values  and  the  change  in 
pitch  values  from  frame  to  frame.  The  changes,  or  delta  values, 
are  much  more  useful.  The  number  of  occurrences  of  zero  change 
included  the  unvoiced  portions  of  speech,  but  also  included  a 
good  deal  of  the  voiced  portions.  The  average  transmission  rate 
for  simple  binary  encoding  of  pitch  was  300  bps  (50  frames/sec). 
With  Huffman  coding  of  the  actual  pitch  values,  this  dropped  to 
1 8 0 bps,  and  with  Huffman  coding  of  the  delta  values  the  rate 
dropped  to  130  bps. 

We  have  also  experimented  with  another  method  of  encoding 
the  pitch.  This  method  codes  the  most  likely  value  with  one 
bit,  and  uses  7 bits  for  each  of  the  remaining  values. 
Transmission  rates  obtained  from  this  experiment  were  164  bps 
for  delta  values,  and  225  bps  for  actual  values.  The  advantage 
of  this  encoding  method  is  that  it  does  not  require  any  tree 
search  for  decoding. 

4 . Ga 1 n 

The  histogram  for  gain  was  found  to  be  quite  flat,  so 
Huffman  coding  offered  little  improvement  over  simple  binary 
coding.  However,  using  Huffman  coding  on  delta  values  of  gain, 
we  obtained  a saving  of  75  bps  over  the  binary  encoded  data  rate 
of  250  bps. 
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5.  Coding  Number  of  Poles 

For  variable  order  linear  prediction  using  a maximum  value 
of  P=13>  simple  binary  encoding  of  the  number  of  poles  used  for 
analysis  required  2 90  bps  for  50  frames/se*  transmission  and  148 
bps  for  variable  frame  rate  transmission  (basic  analysis  rate: 
100  frames/sec).  With  Huffman  coding  (see  the  histogram  in 
Fig.  7),  these  rates  dropped  to  159  bps  and  117  bps. 

6.  Coding  Transmission  Interval 

For  variable  frame  rate  transmission , the  time  interval  in 
number  of  frames  needs  to  be  coded  and  transmitted.  For  the 
maximum  interval  size  of  8 frames,  the  simple  binary  encoding 
required  111  bps.  With  Huffman  coding  (see  the  histogram  in 
F1’ g . 13)  it  dropped  to  93  bps. 
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IX.  SYNTHESIS 

In  the  receiver  structure  in  Fig.  1,  the  role  of  decoder  is 
st raight forward , so  we  do  not  discuss  it  here.  After  decoding, 
deemphasis  (or  postemphasis ) is  done  on  the  decoded  parameters 
to  undo  the  effect  of  preemphasis.  For  a fixed  preemphasis, 
deemphasis  can  be  performed  by  passing  the  synthesized  speech 
through  a fixed  one-pole  filter.  For  adaptive  preemphasis, 
deemphasis  can  properly  be  done  only  before  synthesis.  We 
compute  the  inverse  filter  A'(z)  (prime  denoting  the  use  of 
decoded  parameters)  and  multiply  it  by  the  preemphasis  filter 
(1-b'z”1)  to  obtain  the  inverse  filter  for  the  deemphasized 
case.  So  deemphasis  increases  the  order  of  the  predictor  by 
one.  The  coefficients  of  the  augmented  predictor  are  used  for 
synthesis . 

The  remainder  of  this  section  deals  with  the  different 
aspects  or  the  synthesizer.  These  are:  excitation  source, 
implementation  of  the  syntnesizer,  and  interpolation  and 
resetting  of  the  synthesizer  parameters. 

A . Excitation 

We  use  voiced/unvoiced  excitation.  Voiced  excitation 
consists  of  unit  pulses  separated  by  the  received  (or 
interpolated)  pitch  period.  Unvoiced  excitation  consists  of 
white  noise  samples  (zero  mean,  unit  variance,  and  uniformly 
distributed)  produced  at  the  sampling  frequency  using  a random 
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number  generator.  Referring  to  Fig.  3b,  the  excitation  signal 
u^  is  multiplied  by  a suitable  gain  factor  G.  G is  computed  so 
that  the  energy  of  the  input  signal  Gu^  of  the  synthesizer  is 
equal  to  the  energy  of  the  linear  prediction  error  signal.  The 
latter  is  given  by  R'V'  where  primes  indicate  the  use  of  the 
decoded  parameters  (see  equation  (13)).  Assuming  that  R^  is  the 
signal  energy  per  sample,  the  gain  factor  G for  unvoiced 

9 

excitation  is  computed  from  (V  =R^V\  The  gain  factor  Gv  for 

p 

voiced  excitation  is  computed  from  G,'r  = R'V'  P,  where  P is  the 

r V o p 

pitch  period  in  samples. 

With  the  excitation  model  described  above,  we  found  that 
voiced  fricatives  such  as  [z]  sounded  "buzzy”  and  unnatural  when 
synthesized  using  voiced  excitation.  Ideally  such  synthesis 
should  use  a proper  mixture  of  both  types  of  excitation. 
However,  we  obtained  satisfactory  results  by  synthesizing  voiced 
fricatives  as  merely  unvoiced  sounds.  To  make  this  happen 
automatically,  we  readjusted  the  threshold  for  zero  crossing 
rate  used  in  the  pitch  extraction  scheme  at  the  analysis  so  that 
an  unvoiced  decision  would  be  reached  for  analysis  frames 
containing  voiced  fricatives. 

Another  possible  improvement  that  we  studied  briefly  was  to 
modify  trie  shape  of  the  pulse  excitation  for  voiced  sounds.  We 
ran  an  experiment  using  Rosenberg's  polynomial  excitation  [29] 
to  test  its  effect  on  the  quality  of  the  synthesis,  but  the 
results  were  not  conclusive 
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B.  Transfer  Function 

There  are  at  least  two  ways  in  which  to  implement  the 
transfer  function  of  the  syntnesizer.  The  recursive  filter  or 
canonical  form  implementation  uses  the  predictor  coefficients. 
The  second  implementation  applies  Itakura's  ladder  structure  [8] 
that  uses  the  reflection  coefficients.  It  has  l v,n  shown  that 
the  second  method  of  transfer  function  realization  results  in 
lower  sensitivity  to  errors  caused  by  finite  wordlength 
computations  [30].  Therefore,  it  should  be  used  in  real  time 
implementation  employing  integer  arithmetic.  In  our 
non-real-time  floating  point  simulation  experiments,  we  used 
only  the  recursive  filter  implementation.  For  such  a situation, 
the  two  methods  would  give  essentially  the  same  results. 

C.  Parameter  Setting  and  Interpolation 

Decoded  parameter  values  as  supplied  by  the  vector  x'(t)  in 
Fig.  1 are  used  to  update  or  reset  the  parameter'-  of  the 
synthesizer.  There  are  two  types  of  parameter  ting: 
time-synchronous  and  time-asynchronous.  A particular  case  of 
the  second  is  pitch-synchronous  updating.  Usually,  the 
parameters  are  reset  at  a higher  rate  than  the  rate  of  parameter 
transmission.  Thus,  some  interpolation  must  be  performed 
between  the  decoded  parameter  values.  In  time-synchronous 
synthesis,  parameters  are  interpolated  and  updated  at  some  fixed 
rate.  In  pitch-synchronous  synthesis,  parameter  interpolation 
and  setting  are  done  at  every  pitch  pulse  for  voiced  sounds  and 
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time-synchronously  for  unvoiced  sounds. 

1 . Time-Synchronous  Versus  Pit ^-Synchronous  Synthesis 

We  investigated  both  t ime-synchronous  synthesis  and 
pitch-synchronous  synthesis  in  our  experiments.  Since  pitch  and 
filter  parameters  are  not  updated  simultaneously  in 
tine-synchronous  synthesis,  one  might  suspect  that  this  would 
introduce  undesirable  transients  in  the  synthesized  speech. 
However,  from  the  large  number  of  experiments  that  we  performed, 
we  found  that  the  synthesized  speech  did  not  have  any  such 
transients.  As  an  example,  Fig.  17  shows  the  waveform  of 
segments  of  speech  synthesized  time  synchronously  (vertical 
lines  mark  the  instances  when  parameter  updating  was  done). 
Further,  our  comparative  study  of  time-synchronous  synthesis  and 
pitch-synchronous  synthesis  showed  that  the  quality  of  the 
synthesized  speech  was  actually  better  for  time-synchronous 
synthesis  in  some  experiments,  while  in  others  it  remained 
essentially  the  same  for  both  cases.  In  general,  we  found 
that  speech  quality  was  best  when  the  synthesizer  parameters 
were  updated  at  a time  corresponding  to  the  time  when  they  were 
extracted  in  the  analysis.  Thus,  if  time-synchronous  analysis 
is  used,  time-svnehronou  synthesis  should  also  be  used. 

Another  reason  why  t ime-synchronous  synthesis  produced 
better  speech  quality  follows  from  the  result  reported  in 
Section  XII-A  that  interpolation  is  a major  source  of  error. 
This  is  not  to  suggest  that  interpolation  should  never  be  done. 
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Fin.  17.  Four  segments  of  speech  synthesized 

time  synchronously.  The  vertical  lines 
(5  msec  apart)  nark  the  instances  when 
parameters  of  the  synthesizer  were  updated. 
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In  fact,  for  the  50  frames'sec  constant  frame  rate  transmission 
( 2650  bps),  use  of  interpolation  (time-synchronous  or 
pitch-synchronous)  certainly  improved  the  speech  quality.  Our 
finding  cited  above  should  rather  be  interpreted  as  a caution  to 
use  interpolation  only  if  needed.  Returning  to  the  two 
synthesis  approaches,  if  time-synchronous  analysis  is  used,  then 
pitch-synchronous  synthesis  would  require  interpolation  every 
pitch  period  in  general.  For*  time-synchronous  synthesis, 
however,  no  interpolation  is  needed  for  those  instances  at  which 
analysis  parameters  have  been  extracted  and  transmitted.  The 
fact  that  less  interpolation  is  performed  in  t ime-synchronous 
synthesis  perhaps  explains  the  resulting  improvement  in  speech 
quality  over  pitch-synchronous  synthesis. 

It  should  De  recalled  that  in  our  study  we  used  the 
recursive  filter  implementation  of  the  synthesizer.  We  expect 
the  results  to  cone  out  similar  even  when  the  ladder  structure 
is  used. 

2.  Interpolation  Study 

In  our  speech  compression  system,  we  interpolated  both 
pitch  ana  energy  logarithmically.  For  the  synthesizer  filter, 
we  used  different  sets  of  parameters  for  linear  interpolation. 
These  included  reflection  coefficients,  log  area  ratios  and 
autocorrelation  coefficients  of  the  all-pole  filter.  Stability 
of  the  filter  is  preserved  under  interpolation  in  all  the  three 
cases.  The  different  interpolation  parameters  resulted  in 
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slight  differences  in  the  spectrum  of  the  linear  prediction 
filter,  but  the  quality  of  the  synthesized  speech  as  judged  from 


informal  listening 

tests 

did 

not 

show  any 

perceivable 

differences.  In  view 

of  the 

lack 

of 

di f ference 

in  speech 

quality  between  the  different  interpolation  parameters,  we  used, 
in  many  of  our  recent  experiments,  log  area  ratios  for 
interpolation  since  they  were  available  directly  from  the 
decoder . 
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X.  SIMULATION  OP  SPEECH  COMPRESSION  SYSTEM 

In  previous  sections  we  frequently  alluded  to  our 
simulation  experiments.  We  briefly  describe  below  some  of  the 
details  of  our  simulated  speech  compression  system.  We  also 
give  typical  transmission  rates  when  using  the  different 
techniques  discussed  above  separately  and  in  different 
combinations.  A result  of  significant  importance  is  that  when 
we  incorporated  all  our  bit-saving  techniques,  we  obtained  good 
quality  speech  at  average  rates  of  1500  bps. 

A . Software  Simulation 

We  simulated  the  entire  speech  compression  system  with  its 
many  different  variations  on  our  time-sharing  computer  facility 
comprising  two  PDP-10  computers  called  System  A and  System  B. 
All  computations  were  done  on  System  A using  36-bit  word 
floating  point  arithmetic.  The  analog-to-digi tal  converter 
(ADC)  and  the  digital -to-analog  converter  (DAC)  were  located  in 
System  B.  Digitizing  speech  from  a tape  and  recording  (or 
listening  to)  the  synthesized  speech  were  done  on  System  B in 
single  user  mode  so  as  to  provide  the  fast  service  rate  needed 
for  these  real-time  functions.  Sampled  speech  files  and 
synthesized  speech  f i ..es  were  transferred  between  the  two 
computer  systems  using  the  ARPA  Network  link.  In  all  the 
developmental  and  simulation  phases  of  our  work,  we  heavily  used 
the  IMLAC  PD3-1  display  facility,  which  can  be  operated  from 
either  of  the  two  systems. 
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The  synthesized  speech  was  passed  through  a 12-bit  DAC  and 
low-pass  filtered  sharpy  at  5 kHz  before  presenting  it  to 
listeners  or  recording  it  on  tape. 

Due  to  limitations  of  our  pitch  extraction  scheme, 
occasionally  (less  than  \%  of  the  time)  we  encountered  pitch 
errors  (mainly  pitch  doubling  errors,  especially  for 
high-pitched  female  speakers).  Since  we  were  principally 
interested  in  testing  many  of  the  quantization  and  encoding 
methods,  we  hand  edited  these  errors  to  prevent  the  pitch 
discrepancies  from  biasing  the  listening  judgments.  The  overall 
time  taken  for  analysis  and  synthesis  in  our  simulation  system 
was  about  50-60  tines  real  tine. 

B.  Typical  Transmission  Rates 

For  the  data  given  belcw,  we  used  psll  for  fixed  order 
linear  prediction  and  a maximum  p= 1 1 for  variable  order  linear 
prediction.  Pitch  and  gain  were  quantized  using  6 bits  and  5 
bits  respectively.  Log  area  ratios  were  quantized  using  the 
data  given  in  Tabic  I.  For  the  fixed  order  case  this  required, 
before  further  encoding,  41  bits/frame  for  unvoiced  sounds  and 
43  bits/frame  for  voiced  sounds.  The  log  likelihood  ratio 
threshold  used  for  the  variable  frame  rate  system  was  1.5  dB. 
When  variable  wordlength  encoding  was  used,  we  used  Huffman 
coding  and  delta  encoding  as  mentioned  in  Section  VIII,  Several 
different  speech  compression  systems  and  their  average 
transmission  rates  are  given  in  Table  II. 
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Referring  to  Table  II,  the  quality  of  the  synthesized 
speech  differed  little  between  systems  1,  2 and  3,  and  between 
5,  6 and  7.  Speech  quality  of  system  5 was  only  slightly  lower 
than  that  of  system  *4,  in  spite  of  a reduction  in  the 
transmission  rate  by  a factor  of  about  2.5.  This  illustrates  the 
successful  performance  of  our  variable  frame  rate  transmission 
scheme.  Although  the  bit  rate  of  system  5 is  lower  than  that  of 
system  1 by  about  20$,  speech  quality  was  found  to  be  actually 
better  for  system  5 than  for  system  1.  This  suggests  that 
starting  with  a higher  analysis  rate  and  transmitting  only  when 
necessary  produces  a better  dynamic  modeling  of  speech  from  the 
point  of  view  of  perception. 

Using  system  6 in  Table  II,  we  processed  a tape  containing 
an  11-sentence  dialogue  provided  by  the  Stockholm  Speech 
Communication  laboratory  to  the  participants  of  the  1974 
Stockholm  Speech  Conference.  The  dialogue,  which  is  between  a 
female  telephone  operator  (pitch  range  108-417  Kz)  and  a male 
customer  (Ditch  range  67-323  Hz),  provides  difficult  test 
material  for  any  vocoder.  Using  this  dialogue,  we  demonstrated 
good  quality  synthesized  dialogue  at  1650  bps  at  an  ARPA  Network 
Speech  Compression  (NSC)  meeting  and  at  two  other  conferences 
[31,32]. 
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XI.  REAL  TIMF  IMPLEMENTATION 

In  tne  second  year  of  our  speech  compression  project,  we 
worked  in  cooperation  with  the  other  sites  in  the  ARPA  community 
towards  implementation  of  a linear  predictive  vocoder  that 
transmits  speech  in  real  time  over  the  ARPA  Network.  This  work 
has  not  yet  been  completed.  In  this  section,  we  summarize  the 
work  we  have  done  thus  far. 

A.  Signal  Processing  System 

To  date,  our  effort  in  the  development  of  a signal 
processing  system  for  implementing  the  vocoder  has  been  largely 
a matter  of  system  definition  and  information  exchange.  In 
defining  the  system,  we  have  considered  the  needs  of  both  the 
speech  compression  project  and  the  speech  understanding  project. 
The  requirements  of  these  projects  indicate  that  the  system  will 
have  three  distinct  purposes:  (a)  To  function  as  a real-time 

data  acquisition  system,  supporting  ADC's  and  DAC's  at  sampling 
rates  up  to  20  kHz,  and  providing  a means  of  storing  and 
retrieving  speech  utterances;  (b)  To  allow  the  implementation 
of  real-time  speech  analysis  and  synthesis,  and  to  make  possible 
the  transfer  of  coded  speech  over  the  ARPA  Network;  (c)  To 
provide  signal  processing  computational  power  to  multiple  users, 
functioning  as  a peripheral  to  our  TENEX  time-sharing  system  and 
serving  to  remove  some  of  the  computing  load  from  it. 
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In  cooperation  with  the  other  sites  involved  in  these  two 
projects,  we  have  been  investigating  various  items  of  hardware 
and  software  with  which  to  implement  the  system.  This 
cooperation  has  resulted  in  the  network-wide  selection  of  the 
3PS-41  as  the  signal  processing  computer,  one  of  the  most 
critical  parts  of  the  system.  This  cooperation  has  also  given 
the  sites  involved  a considerable  leverage  with  the 
manufacturer,  Signal  Processing  Systems  (SPS),  resulting  in 
significant  hardware  and  software  improvements  to  the  original 
version  of  the  SPS-41.  These  improvements  included  a dual-port 
memory  option  and  the  availability  of  a double-precision 
autocorrelation  routine.  Our  complete  signal  processing  system 
comprising  the  SPS-41  and  PDP-11  will  be  interfaced  to  the  ARPA 
Network . 

We  have  disseminated  inform^  ion  as  it  became  available 
from  the  manufacturers , particularly  SPS,  and  exchanged 
information  with  the  other  sites  in  order  to  avoid  duplication 
of  effort  and  ensure  maximum  compatibility. 

Support  software  for  the  SPS-41 

In  close  cooperation  with  the  Information  Sciences 
Institute,  we  have  been  working  on  an  on-line  loader  system  for 
the  SPS-41,  The  software  package  supplied  by  SPS  includes  only 
an  off-line  loader  which  is  unsuitable  for  real-time 
applications . 
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The  on-line  system  consists  of  two  parts,  the  Overlay 
Executive  (EXEC)  and  the  Automatic  Reformatter  (ARE)*  The  EXEC 
is  an  SPS-41  program  which  loads  information  from  the  PDP-11 
into  the  SPS-41.  ARF  reformats  the  output  of  the  SPS-41 
assembler  in  a way  acceptable  to  the  EXEC.  It  also  provides  a 
mechanism  for  attaching  meaningful  labels  to  SPS-41  program 
segments  and  locations.  A user's  guide  for  ARF  is  currently 
being  prepared. 

B.  Variable  Speed  Speech 

In  the  implementation  of  the  receiving  end  or  synthesizer 
of  the  speech  compression  system,  it  is  necessary  to  have  a 
buffer  whose  size  will  depend  on  the  expected  maximum  delay  in 
the  network  transmission  of  the  vocoded  bit  stream.  However, 
tnere  might  be  times  when  the  expected  maximum  delay  is 
exceeded.  In  such  cases,  the  buffer  at  the  synthesizer  will  be 
empty  for  some  period  of  time  till  the  data  appears  again.  The 
data  might  then  arrive  at  such  a rate  that  the  buffer  overflows. 
This  is  an  undesirable  situation  since  speech  will  be  lost.  One 
solution  to  this  problem  is  to  speed  up  the  rate  of  speech 
synthesis  until  the  condition  is  normal  again.  Also,  before  the 
buffer  runs  out  of  data,  it  might  be  desirable  to  slow  down  the 
rate  of  synthesis  until  the  data  arrives.  However,  this 
involves  predicting  in  advance  the  occurrence  of  the  excessive 
delay.  We  have  already  demonstrated  the  feasibility  of  variable 
speed  synthesis  in  an  NSC  meeting  (May  1974).  This  method 


-92- 


Report  No.  2976 
Volume  II 


Bolt  Beranek  and  Newman  Inc 


involved  merely  redefining  the  duration  of  the  synthesis  frame 
appropriately.  The  effectiveness  and  acceptability  of  this 
method  to  ameliorate  the  consequences  of  excessive  network 
delays  should  be  tested. 
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XII.  MISCELLANEOUS  T' 

Some  additional  ..s  sues  that  we  have  investigated  are 
reported  in  this  section. 

A.  Measures  for  Objective  Evaluation  of  Speech  Quality 

The  following  two  reasons  motivated  us  to  develop  measures 
for  objective  evaluation  of  speech  quality. 

1.  Evaluation  of  speech  quality  has  been  done  mostly  through 
subjective  listening  tests.  It  would  be  desirable  to  develop 
objective  measures  that  correlate  well  with  the  scores  obtained 
in  subjective  listening  tests.  Besides  their  theoretical 
appeal,  these  measures  would  ensure  uniformity  in  evaluation  as 
well  as  enable  the  evaluation  to  be  done  by  computer.  Also, 
they  can  be  used  in  the  design  of  better  speech  quality 
vocoders.  While  there  exist  methods  in  the  literature  for 
objectively  evaluating  the  intelligibility  of  speech  in  the 
presence  of  stationary  noise  [ 33 3 » little  has  been  done 
regarding  the  objective  evaluation  of  either  the  intelligibility 
or  the  quality  of  vocoded  speech. 

2.  In  many  of  our  experiments,  we  specifically  observed  that 

a change  in  speech  quality  due  to  any  one  improvement  in 
quantization,  interpolation,  etc.,  was  most  often  not 

perceivable,  while  when  several  such  improvements  were  added 
together  there  was  a clearly  perceivable  improvement  in  speech 
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quality.  It  would  be  helpful  therefore  to  incorporate  some 
objective  measures  cf  speech  quality  within  the  speech 
compression  system,  which  would  generate  performance  scores  and 
hence  enable  one  to  make  relative  judgments  of  the  smaller 
differences . 

For  our  linear  predictive  speech  compression  system  we  used 
the  two  measures,  (i)  log-area-ratio  error  measure  and 
(ii)  spectral  error  measure,  to  determine,  for  a given  speech 
frame,  the  deviation  between  the  synthesized  speech  and  the 
original  speech.  Based  on  the  error  data  computed  for  a large 
number  of  speech  sounds,  appropriate  measures  could  be  developed 
for  speech  quality  evaluation.  Below  we  describe  how  we 
computed  the  error  data  in  our  simulation  system.  Speech 
parameters  were  extracted  at  a rate  of  200  frames/sec  to  provide 
the  reference  data  with  enough  resolution  in  time.  They  were 
quantized  and  transmitted  at  a lower  rate  required  by  the 
specific  compression  system  under  evaluation.  After  decoding, 
the  parameters  were  interpolated  to  produce  the  test  data  of  200 
frames/sec.  The  log  area  ratio  error  was  obtained  as  the 
average  of  the  absolute  differences  between  the  sets  of  log  area 
ratios  from  the  reference  and  test  data.  The  spectral  error  was 
computed  by  averaging  in  frequency  the  absolute  differences 
between  he  values  of  the  log  spectra  of  the  linear  predictor 
with  coefficients  from  the  reference  and  test  data.  For  each 
measure,  the  time  history  of  the  error  within  a speech 
utterance,  the  time-averaged  value  of  the  error  and  its 
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variance,  and  the  maximum  error  observed  could  all  be 
potentially  useful  in  the  quality  evaluation.  We  did  not  get  an 
opportunity  thus  far  to  compare  these  scores  with  formal 
subjective  evaluation  results.  However,  by  investigating  the 
measures  (i)  and  (ii)  for  several  linear  predictive  vocoders  and 
diverse  speech  utterances,  we  obtained  the  following  two 
results . 

(a)  The  error  (log  area  ratio  or  spectral)  due  to 


interpolation 

was 

much  larger 

than 

the 

error  due  to 

quantization . 

This 

result  was 

used 

to 

interpret  the 

quality  difference  between  time-synchronous  and 
pitch-synchronous  methods  of  synthesis  (Section  IX).  An 
important  .nitrence  suggested  by  the  result  is  that  better 
parameter  ini, polation  approaches  than  the  simple  linear 
scheme  should  be  developed. 

(b)  Scatter  plots  between  the  spectral  and  the  log  area  ratio 
ei  ror  measures  obtained  using  different  quantization  step 
sizes  indicated  a fairly  strong  linear  relationship 
between  the  two  measures.  Fig.  18  shows  the  scatter  plot 
obtained  from  4 speech  utterances  and  using  4 quantization 
step  sizes  (0.5, 1,2,3  dB).  The  coefficient  of  correlation 
for  the  least  squares  linear  fit  also  shown  in  Fig.  18  was 
0.88.  A useful  implication  of  this  result  is  that  the  log 
area  ratio  error  measure  can  be  substituted  wherever  the 
spectral  error  measure  is  needed,  which  is  computationally 
advantageous.  As  an  example,  we  cite  the  use  of  the  log 
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Fig.  18.  Scatter  plot  of  log  area  ratio  error  versus  spectral  error 
linear  characteristic  passing  through  the  origin  is  the  be 
least  squares  fit  of  a linear  relationship  between  the  two 
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area  ratio  measure  in  variable  frame  rate  transmission  for 
detecting  changes  in  the  speech  spectrum. 

B.  Variable  Sampling  Frequencies 

It  is  often  desirable  to  be  able  to  test  the  performance  of 
a speech  compression  system  at  different  sampling  rates,  without 
actually  sampling  at  all  those  rates.  This  can  be  achieved  by 
applying  our  recently  developed  method  of  selective  linear 
prediction  [24]  to  speech  sampled  at  only  the  highest  desirable 
rate  Briefly,  in  this  method,  the  spectral  matching  properties 
of  the  autocorrelation  method  [6,7]  are  used  to  model  a selected 
portion  of  the  speech  spectrum  by  an  all-pole  spectrum.  The 
method  is  based  on  the  idea  that  the  autocorrelation 
coefficients  in  (8)  can  be  computed  from  the  spectrum  instead  of 
the  signal.  Thus,  the  speech  signal  is  sampled  at  the  highest 
rate  and  the  short-time  spectrum  is  computed  for  every  frame. 
Different  sampling  rates  can  then  be  simulated  by  computing  the 
autocorrelation  coefficients  from  the  part  of  the  spectrum 
corresponding  to  each  sampling  rate.  The  problems  of  sharp 
filtering  and  down  sampling  that  exist  in  time  domain  methods 
can  therefore  be  avoided  by  simply  working  in  the  ft cquency 
domain . 

C.  Formant  Bandwidth  Correction 

When  the  spectral  characteristics  of  the  synthesized  speech 
were  compared  with  those  of  the  original  speech,  it  was  observed 
that  occasionally  the  bandwidths  of  the  formants  of  the 
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synthesized  speech  were  relatively  large.  This  was  also 
manifested  in  the  time  waveform  as  rapidly  decaying  sinusoids. 
Sl’C-‘  increases  in  bandwidths  were  found  to  occur  mainly  in 
nasals  and  nasalized  vowels.  This  phenomenon  is  perhaps  due  to 
the  limitations  of  the  linear  prediction  method  which  assumes  an 
all-pole  filter.  Thus,  a pole-zero-pole  cluster  in  a nasal 
sound  may  get  represented  by  a wide  bandwidth  formant  in  the 
•linear  prediction  method.  Synthesis  experiments  were  conducted 
where,  when  necessary,  bandwidth  corrections  were  made  after 
interpolation  of  the  synthesizer  parameters.  The  biggest 
problem  that  was  encountered  was  the  need  to  identify  the 
ordered  formants.  Any  error  in  such  identification  was  found  to 
introduce  undesirable  "blips"  in  the  sy.-.thesized  speech.  Even 
wher  the  formants  were  determined  accurately,  informal  listening 

tests  did  not  indicate  any  significant  improvement  in  quality  by 
the  bandwidth  correction. 


Parameter  Smoothing 

A study  of  the  time  series  of  analysis  parameters  (energy, 
Pitch  and  reflection  coefficients)  indicated  that  occasionally 
some  parameters  changed  rather  rapidly,  possibly  contributing  to 
toughness  or  "unevenness"  that  was  sometimes  present  in 
the  synthesized  speech.  These  rapid  variations  can  be  smoothed 
out  by  proper  low-pass  filtering.  We  used  a three-point 
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smoothing  filter  with  weights  0.25,  0.5G  and  0.25.  Smoothing  was 
done  just  prior  to  interpolation  at  the  synthesizer.  Genuine 
jumps  in  parameter  values  were  preserved  by  not  smoothing  in 
transitions  between  voiced  and  unvoiced  sounds,  and  when 
parameter  changes  exceeded  preset  thresholds  (10  Hz  for  pitch 
and  3 dB  fo.*  energy).  Identical  low-pass  filters  were  used  for 
smootning  pitch,  energy  and  reflection  coefficients  (or  log  area 
ratios) . 

Several  synthesis  experiments  were  performed  using  smoothed 
parameters.  Informal  listening  tests  showed  that  with 
smoothing,  speech  quality  improved  in  some  instances  in  that  the 
"unevenness"  observed  without  smoothing  disappeared.  But, 
smoothing  also  made  the  synthesized  speech  sound  less  "crispy" 
or  more  "smeared"  in  many  instances. 
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XIII.  CONCLUSIONS 

For  linear  predictive  speech  compression  systems,  we  have 
developed  many  methods  of  reducing  the  redundancy  in  the  speech 
signal  while  maintaining  good  speech  quality  at  the  synthesis. 
Included  among  these  methods  are  preemphasis  of  the  incoming 
speech,  adaptive  optimal  selection  of  predictor  order,  optimal 
selection  and  quantization  of  transmission  parameters,  variable 
frame  rate  transmission,  optimal  encoding,  and  improved 
synthesis  methodology.  When  we  incorporated  all  of  these  in  a 
floating  point  simulation  of  linear  predictive  vocoder,  we 
obtained  synthesized  speech  with  high  quality  at  transmission 
rates  as  low  as  1500  bps. 

Speech  quality  would  perhaps  decrease  when  the  vocoder  is 
implemented  in  real  time  using  relatively  small  wordlength 
computers  or  hardware.  This  would  necessitate  development  of 
still  other  methods  of  improving  speech  quality.  We  feel  that 
such  improvements  can  be  achieved  by  developing  interpolation 
schemes  which  track  the  dynamic  behavior  of  speech  more  closely. 
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